From landman at scalableinformatics.com Sun Nov 1 19:25:46 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 01 Nov 2009 22:25:46 -0500 Subject: [Beowulf] Storage recommendations? In-Reply-To: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk> References: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk> Message-ID: <4AEE513A.6050504@scalableinformatics.com> Robert Horton wrote: > Hi, > > I'm looking for some recommendations for a new "scratch" file server for > our cluster. Rough requirements are: > > - Around 20TB of storage > - Good performance with multiple nfs writes (it's quite a mixed workload > so hard to characterise further) > - Data security not massively important as it's just for scratch / > temporary data. > > It'll just be a single server serving nfs, I'm not looking to go down > the luster / pvfs route. a) what network fabric (IB, 10GbE, GbE, ...) b) roughly how many simultaneous writers ... large block streaming, or small block random? How much sustained IO (MB/s) do you need to support your worker machines? c) looking to build it your self or buy units that work? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From robh at dongle.org.uk Mon Nov 2 06:03:35 2009 From: robh at dongle.org.uk (Robert Horton) Date: Mon, 02 Nov 2009 14:03:35 +0000 Subject: [Beowulf] Storage recommendations? In-Reply-To: <4AEE513A.6050504@scalableinformatics.com> References: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk> <4AEE513A.6050504@scalableinformatics.com> Message-ID: <1257170615.6802.51.camel@moelwyn.maths.qmul.ac.uk> On Sun, 2009-11-01 at 22:25 -0500, Joe Landman wrote: > Robert Horton wrote: > > Hi, > > > > I'm looking for some recommendations for a new "scratch" file server for > > our cluster. Rough requirements are: > > > > - Around 20TB of storage > > - Good performance with multiple nfs writes (it's quite a mixed workload > > so hard to characterise further) > > - Data security not massively important as it's just for scratch / > > temporary data. > > > > It'll just be a single server serving nfs, I'm not looking to go down > > the luster / pvfs route. > > > a) what network fabric (IB, 10GbE, GbE, ...) Currently using the GigE network for the NFS traffic. There is a DDR Infiniband network in place for the MPI traffic which could potentially be used for storage, however my impression is that the current bottleneck is the disk io (or iops) rather than network, so I'm not sure that this would be worth the mucking about. > b) roughly how many simultaneous writers ... large block streaming, or > small block random? How much sustained IO (MB/s) do you need to support > your worker machines? The jobs typically use 32 nodes with 8 processes on each node, so I guess ~250. Averaged over 5 minutes, maximum IO is around 15MB/s read and 0.5MB/s for write, however I've seen short periods of around 30MB/s write. The problem, basically, is that under certain heavy IO conditions which I haven't managed to reliably reproduce, the whole machine becomes very slow, in some cases making interactive work pretty much impossible. Hence the desire to move the storage off the headnode. > > c) looking to build it your self or buy units that work? > Preferable the second option, although I've not completely ruled out building something. From prentice at ias.edu Tue Nov 3 09:09:07 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 03 Nov 2009 12:09:07 -0500 Subject: [Beowulf] Fortran Array size question Message-ID: <4AF063B3.3050508@ias.edu> This question is a bit off-topic, but since it involves Fortran minutia, I figured this would be the best place to ask. This code may eventually run on my cluster, so it's not completely off topic! Question: What is the maximum number of elements you can have in a double-precision array in Fortran? I have someone creating a 4-dimensional double-precision array. When they increase the dimenions of the array to ~200 million elements, they get this error: compilation aborted (code 1). I'm sure they're hitting a Fortran limit, but I need to prove it. I haven't been able to find anything using The Google. -- Prentice From eugen at leitl.org Tue Nov 3 09:37:17 2009 From: eugen at leitl.org (Eugen Leitl) Date: Tue, 3 Nov 2009 18:37:17 +0100 Subject: [Beowulf] A look at the 100-core Tilera Gx Message-ID: <20091103173717.GT17686@leitl.org> http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/ A look at the 100-core Tilera Gx It's all about the network(s) by Charlie Demerjian October 29, 2009 TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and while this is true, the real interesting technology is in the interconnects. The overall chip is quite a marvel, and it is unlike any mainstream CPU you have ever heard of. Making a lot of cores on a chip isn't very hard. Larrabee for example has 32 Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel wanted to, it could put hundreds of cores on a die, that part is actually quite easy. Keeping those cores fed is the most important problem of modern chipmaking, and that part is not easy. Large caches, wide memory busses, ring busses on chip, stacking, and optical interfaces all are attempts to feed the beast. Everyone thought Intel's Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago, was about packing cores onto a die. It wasn't, it was a test of routing algorithms and structures. Routing is where the action is now, packing cores in is not a big deal. Routing is where Tilera shines. It has put a great deal of thought into getting data from core to core with minimal latency and problems. Its rather unique approach involves five different interconnect networks, programmable partitioning, accelerators, and simply tons of I/O. Together, these allow Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without choking on congestion. They may not have the same single-threaded performance of a Nehalem or Shanghai core, but they make up for it with volume. 100 core diagram Tilera 100 core chip The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10, each connected via five (5) on-chip networks, and flanked by some very interesting accelerators. The cores themselves are a proprietary 32-bit ISA in the first two generations of Tilera chips, and in the Gx, it is extended to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and the memory controller now sees 64 bits as well. In previous generations, there was no floating-point (FP) hardware in Tilera products. The company strongly recommended against using FP code because it had to be emulated taking hundreds or thousands of cycles. With the new Gx series chips, FP code is still frowned upon, but there is some FP hardware to catch the odd instruction without a huge speed hit. The 100 core part can do 50 GigaFLOPS of FP which may sound like a large number, but that is only about 1/50th of what an ATI Cypress HD5870 chip can do. The majority of the new instructions are aimed at what the Tilera chips do best, integer calculations. Things like shuffle and DSP-like multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where these new chips shine. Basically, the Gx moves information around very quickly while twiddling bits here and there with integer functions. While the cores might not be overly complex, the on-chip busses are. Each Gx core has 64K of L1 cache, 32K data and 32K instruction, along with a unified 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent, and the cache subsystem can reorder requests to other caches or DRAM. On top of this, the core supports cache pinning to keep often used data or instructions in cache. On the 100 core model, the Gx has 32MB of cache. Tiles are the name Tilera uses for for a basic unit of repetition. The 16 core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core, the L1 and L2 caches, and something Tilera calls the Terabit Switch. More than anything, this switch is the heart of the chip. Tile diagram A Tilera tile Remember when we said that cramming 100 cores on a die is not a big problem, but feeding them is? The Terabit Switch is how Tilera solves the problem, and it is a rather unique solution. Instead of one off-core bus, there are five. Each of them has a dedicated purpose, and that not only gives huge bandwidth, it also goes a fair way towards minimizing contention. Cache traffic will never be stepped on by user data, and so on. The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two generations of Tilera chips, all of these networks were 32 bits wide, but on the Gx, the widths vary to give each one more or less bandwidth depending on their functions. QDN is called the reQuest Dynamic Network, and it is used for memory and cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used to feed memory reads back to the chips. RDN is 112 bits wide, an odd number, 64 + 48 from the look of it. FDN is the widest at 128 bits, and it is used for cache to cache transfers and cache coherency. Given the critical nature of cache transactions like this, the width is no surprise. The last two IDN and UDN are both 32 bits wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a dedicated channel for off-chip transfers, you can see that reaching theoretical numbers was a priority at Tilera. The last network UDN is for User Dynamic Network, basically the one users get to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping, they work in the background. If you want to send things from point A to point B, you send it across the UDN. Although Tilera didn't explicitly state it, each hop from router to router takes one cycle. This means that in a pathological case, corner core to memory on the far corner, it could take 19 cycles to go from request to memory, plus the memory round trip time, and then another 19 cycles to get back. That is what you call a long time in computer speak. Even in an 'average' case, you have a 10 cycle latency, which is very long as well. To be fair, the Tilera architecture is not made to run general purpose code. As it was described when the first generation came out, workloads are meant to be chunked up, so a single tile does a function, then the data gets passed to the next tile for more work, and so on and so forth. If your program has 20 steps, you use 20 tiles and pipeline the work. This solves many of the problems with variable latency and multi-hop traffic. The other more elegant solution is the ability to section off chunks of the chip into sub-units. There is a hypervisor that can partition each Gx chip into programmable blocks. Chunking tiles Sub-sections of tiles As you can see in the diagram above, each Gx is broken up into sub-chips in software. You can give each process as much CPU power as it needs, and arrange it so the output of one block feeds into the input of the next in a single clock. This example has two Apache web server instances, an intrusion prevention system (IPS), a secure sockets layer (SSL) stack, a network stack and a few other processes running next to each other. The Apache instances have their own memory controller, as do the IPS and the SSL stack. The network stack is sitting on top of the memory controller for decreased latency. Basically, the programmer can choose where to put each process to minimize latency. It doesn't take much to figure out how to apply these concepts to a database plus web server scenario, or a three-tiered SAP-like workload. Basically, Tilera allows you to explicitly place the data and compute resources where, when and how you need them. The chunks are done at roughly the same level as hardware VMs are in x86 CPUs, running below the level that a process can affect. This creates hardware walls to segregate data transfers, cache coherency traffic, and other tile to tile transfers. If done correctly, it can minimize latency a lot in addition to keeping processes from stepping on each other. Now that you know how the cores work, talk, and are partitioned, what about the 'uncore'? Talk about that starts with the memory controllers - four DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core models. For the keen eyed out there, this means Tilera has two different socket configurations, one for the 64 and 100 core chips, and another one for the 16 and 36 core chips. DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps per contr. Basically, this chip has a lot of available bandwidth. As you might imagine, on the 16 and 36 core variants, there are only half the controllers, so half the bandwidth. In addition, you have a generic controller for USB, UARTs, JTAG and I2C controllers. Given that Tilera chips are basically embedded, these are not likely to be used for much more than booting and diagnostics. On the core diagram above, there are two other blocks, the orange MiCA and mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic' happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is short for multicore Programmable Intelligent Packet Engine. If it isn't blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O. The mPIPE does a lot of interesting things, all supposedly at wire speed. It has a programmable packet classification engine, said to be usable at 80Gbps or 120M packets per second. It can twiddle headers and do other evil things that would make Comcast drool with the potential for 'network management' extortion payements. In addition, it can also load balance across the various I/O lanes, and redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep latencies low. Think of it as a programmable housekeeping offload engine. The most interesting bit is that the mPIPE can tag a packet with a 32 bit header before it sends it onto the internal network. This is where the programmable part shines. You can set up fields in the I/O packet itself to pass along pre-decode information and other time-saving tidbits. Since I/O is fully virtualizable, you could theoretically tag the packets with VM data, or just about anything else a bored programmer can think of. The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto offload engines. They can work either 'inline' or as ull blown offload engines, that is up to the programmer. The MiCA can pull data directly from caches or main memory without CPU overhead, basically fire and forget. If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA, DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be encrypted along with any other text that uses correct grammar. RLY. Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the 100 core, 20K keys per second for the 36 core. Not bad at all. In addition, the MiCA supports a hardware compression engine that uses the tried and true Deflate algorithm. The last piece of the puzzle is something that Tilera calls external acceleration interfaces. This could be as simple as plugging in a PCIe card, but that lacks elegance. The interesting part is a field programmable gate array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the FPGA to the serial deserial unit (SerDes) to enable basically direct and low latency 32Gbps transfers. Direct transfers to cache and multiple contexts are supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip. In the end, you have a monster chip for I/O and packet processing. It doesn't do single-threaded applications all that fast, but it really isn't meant to. The chip itself is not out yet, nor is there even silicon yet. The first version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core later in Q4 or possibly Q1 of 2011. These both share the same socket configuration and a 35*35mm package. In Q1 of 2011, the 100 core chip will come out on a new socket and in a 45*45mm package. A bit after that, the 64 core will hit the market. Power ranges from 10W for the 16 core to 55W for the 100 core, but you can get power optimized variants that will only suck 35W. Given the programmability of the parts, power use is likely more dependent on the programs running on it. The last bit of information is clock speeds. The 64 and 100 core models will come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much there is to synchronize and keep going. The 36 core models will come in 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections, memory and I/O capabilities, Tilera will pack a lot of power into these small packages.S|A From kus at free.net Tue Nov 3 09:41:52 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Tue, 03 Nov 2009 20:41:52 +0300 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF063B3.3050508@ias.edu> Message-ID: In message from Prentice Bisbal (Tue, 03 Nov 2009 12:09:07 -0500): >This question is a bit off-topic, but since it involves Fortran >minutia, >I figured this would be the best place to ask. This code may >eventually >run on my cluster, so it's not completely off topic! > >Question: What is the maximum number of elements you can have in a >double-precision array in Fortran? I have someone creating a >4-dimensional double-precision array. When they increase the >dimenions >of the array to ~200 million elements, they get this error: > >compilation aborted (code 1). > >I'm sure they're hitting a Fortran limit, but I need to prove it. I >haven't been able to find anything using The Google. It is not Fortran restriction. It may be some compiler restriction. 64-bit ifort for EM64t allow you to use, for example, 400 millions elements. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow > >-- >Prentice >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > >-- >??? ????????? ???? ????????? ?? ??????? ? ??? ??????? >? ????? ???????? ??????????? ??????????? >MailScanner, ? ?? ???????? >??? ??? ?? ???????? ???????????? ????. > From prentice at ias.edu Tue Nov 3 10:17:02 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 03 Nov 2009 13:17:02 -0500 Subject: [Beowulf] Fortran Array size question In-Reply-To: References: Message-ID: <4AF0739E.8030700@ias.edu> Mikhail Kuzminsky wrote: > In message from Prentice Bisbal (Tue, 03 Nov 2009 > 12:09:07 -0500): >> This question is a bit off-topic, but since it involves Fortran minutia, >> I figured this would be the best place to ask. This code may eventually >> run on my cluster, so it's not completely off topic! >> >> Question: What is the maximum number of elements you can have in a >> double-precision array in Fortran? I have someone creating a >> 4-dimensional double-precision array. When they increase the dimenions >> of the array to ~200 million elements, they get this error: >> >> compilation aborted (code 1). >> >> I'm sure they're hitting a Fortran limit, but I need to prove it. I >> haven't been able to find anything using The Google. > > It is not Fortran restriction. It may be some compiler restriction. > 64-bit ifort for EM64t allow you to use, for example, 400 millions > elements. > That's exactly the compiler I'm using, and it's failing at ~200 million elements. I'm digging through the Intel documentation. Haven't found an answer yet. -- Prentice From prentice at ias.edu Tue Nov 3 10:25:33 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 03 Nov 2009 13:25:33 -0500 Subject: [Beowulf] A look at the 100-core Tilera Gx In-Reply-To: <20091103173717.GT17686@leitl.org> References: <20091103173717.GT17686@leitl.org> Message-ID: <4AF0759D.9070503@ias.edu> Eugen Leitl wrote: > http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/ > > A look at the 100-core Tilera Gx > > It's all about the network(s) > > by Charlie Demerjian > > October 29, 2009 > In previous generations, there was no floating-point (FP) hardware in Tilera > products. The company strongly recommended against using FP code because it > had to be emulated taking hundreds or thousands of cycles. With the new Gx > series chips, FP code is still frowned upon, but there is some FP hardware to > catch the odd instruction without a huge speed hit. The 100 core part can do > 50 GigaFLOPS of FP which may sound like a large number, but that is only > about 1/50th of what an ATI Cypress HD5870 chip can do. I imagine this short-coming will limit the Tilera Gx's value to most of HPC community. This doesn't even mention DP performance. -- Prentice From lindahl at pbm.com Tue Nov 3 10:39:27 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 3 Nov 2009 10:39:27 -0800 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF0739E.8030700@ias.edu> References: <4AF0739E.8030700@ias.edu> Message-ID: <20091103183927.GC16399@bx9.net> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: > That's exactly the compiler I'm using, and it's failing at ~200 million > elements. I'm digging through the Intel documentation. Haven't found an > answer yet. Your bug report was incomplete: it really matters if the array is automatic or not, or if it's initialized. -- greg From prentice at ias.edu Tue Nov 3 11:24:00 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 03 Nov 2009 14:24:00 -0500 Subject: [Beowulf] Fortran Array size question In-Reply-To: <20091103183927.GC16399@bx9.net> References: <4AF0739E.8030700@ias.edu> <20091103183927.GC16399@bx9.net> Message-ID: <4AF08350.8060107@ias.edu> Greg Lindahl wrote: > On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: > >> That's exactly the compiler I'm using, and it's failing at ~200 million >> elements. I'm digging through the Intel documentation. Haven't found an >> answer yet. > > Your bug report was incomplete: it really matters if the array is > automatic or not, or if it's initialized. > You're right - I should have included a code snippet. It's not my code, so I don't know if I can share all of it. Here's the line where the problem occurs: dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2) If he reduces the 5000000 to a smaller number, it compiles. As shown, he gets this error: ifort adaptnew2.for ... ... compilation aborted for adaptnew2.for (code 1) The compiler is Intel's ifort 11.0.074 I'm not a Fortran programmer, so I'm a little out of my element here. If it was bash or perl, or even C/C++, that'd be a different story. -- Pretnice From richard.walsh at comcast.net Tue Nov 3 12:01:32 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Tue, 3 Nov 2009 20:01:32 +0000 (UTC) Subject: [Beowulf] Fortran Array size question In-Reply-To: <1391956104.3640921257278279065.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <1225611180.3642631257278492593.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> >----- Original Message ----- >From: "Prentice Bisbal" >To: "Beowulf Mailing List" >Sent: Tuesday, November 3, 2009 1:24:00 PM GMT -06:00 US/Canada Central >Subject: Re: [Beowulf] Fortran Array size question > >Greg Lindahl wrote: >> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: >> >>> That's exactly the compiler I'm using, and it's failing at ~200 million >>> elements. I'm digging through the Intel documentation. Haven't found an >>> answer yet. >> >> Your bug report was incomplete: it really matters if the array is >> automatic or not, or if it's initialized. >> >You're right - I should have included a code snippet. It's not my code, >so I don't know if I can share all of it. Here's the line where the >problem occurs: > >dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2) > >If he reduces the 5000000 to a smaller number, it compiles. As shown, he >gets this error: > >ifort adaptnew2.for >... >... >compilation aborted for adaptnew2.for (code 1) Prentice, I do not think the Fortran standard limits the size of one dimension in an array, although you can have only 7 dimensions. This to me must be a limit internal to their compiler. There may be an environmental variable to reset. I would try another Fortran (maybe gfortran) to see if you get similar behavior or find another (different) limit. Limits should really be operating system imposed based on the size of the address space. Intel says as much on the website describing their compiler. Regards, rbw Thrashing River Computing _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Tue Nov 3 12:05:05 2009 From: mathog at caltech.edu (David Mathog) Date: Tue, 03 Nov 2009 12:05:05 -0800 Subject: [Beowulf] Re:Fortran Array size question Message-ID: Prentice Bisbal wrote: > Question: What is the maximum number of elements you can have in a > double-precision array in Fortran? I have someone creating a > 4-dimensional double-precision array. When they increase the dimenions > of the array to ~200 million elements, they get this error: > > compilation aborted (code 1). The two things that come immediately to mind are: 1. The compiler ran out of memory. (In addition to the size of the memory in the machine, check ulimit.) 2. The compiler is trying to build the program with 32 bit pointers and it cannot address this array, or perhaps all memory accessed, with a pointer of that size. If that is the issue using 64 bit pointers should solve the problem, but I can't tell you what compiler switches are needed to do this. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From atp at piskorski.com Tue Nov 3 12:27:53 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Tue, 3 Nov 2009 15:27:53 -0500 Subject: [Beowulf] A look at the 100-core Tilera Gx In-Reply-To: <20091103173717.GT17686@leitl.org> References: <20091103173717.GT17686@leitl.org> Message-ID: <20091103202753.GA32198@piskorski.com> On Tue, Nov 03, 2009 at 06:37:17PM +0100, Eugen Leitl wrote: > > http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/ > A look at the 100-core Tilera Gx > It's all about the network(s) > by Charlie Demerjian It's ironic that while Tilera's own website points out their heritage from MIT's RAW project, these external magazine articles generally don't even mention it. For actual understanding of the technology it'd probably be more useful to point to and briefly summarize the extensive and well-written MIT research papers, and then explain what the company has actually changed since the academic work, and since Tilera was last in the news with product announcements two years ago (c. Oct. 2007). Btw, is anyone commercializing the (related technology) TRIPS Polymorphic Processor (EDGE architecture) work from the University of Texas? It sounded even more interesting and useful than RAW, but (not being a chip guy at all myself) I had no idea whether that was just hot air or not. http://groups.csail.mit.edu/cag/raw/ http://www.cs.utexas.edu/~trips/ http://www.beowulf.org/archive/2007-February/017414.html http://www.beowulf.org/pipermail/beowulf/2007-October/019617.html http://www.beowulf.org/pipermail/beowulf/2007-October/019621.html http://www.beowulf.org/pipermail/beowulf/2007-October/019677.html -- Andrew Piskorski http://www.piskorski.com/ From atp at piskorski.com Tue Nov 3 12:34:41 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Tue, 3 Nov 2009 15:34:41 -0500 Subject: [Beowulf] A look at the 100-core Tilera Gx In-Reply-To: <4AF0759D.9070503@ias.edu> References: <4AF0759D.9070503@ias.edu> Message-ID: <20091103203441.GB32198@piskorski.com> On Tue, Nov 03, 2009 at 01:25:33PM -0500, Prentice Bisbal wrote: >> With the new Gx series chips, FP code is still frowned upon, but >> there is some FP hardware to catch the odd instruction without a >> huge speed hit. > I imagine this short-coming will limit the Tilera Gx's value to most of > HPC community. This doesn't even mention DP performance. I don't remember anything from the MIT RAW papers suggesting that the technology can't handle floating point, so I assume their integer-only focus was a business decision. If their business is successful, I imagine they'll offer a product intended for floating-point work some years down the road (if they're still around by then, of course). -- Andrew Piskorski http://www.piskorski.com/ From gus at ldeo.columbia.edu Tue Nov 3 12:38:36 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 03 Nov 2009 15:38:36 -0500 Subject: [Beowulf] Re:Fortran Array size question In-Reply-To: References: Message-ID: <4AF094CC.40805@ldeo.columbia.edu> Hi Prentice, list Intel Fortran (at least the 10. and 11.something versions I have) has different "memory models" for compilation. The default is "small". The PGI compiler has a similar feature, IIRR. Have you tried -mcmodel=medium or large? I never used large, but medium helped a few times on x86_64/i64em. Of course your available RAM may be restriction, as David pointed out. An excerpt from "man ifort" is enclosed below. I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From "man ifort": -mcmodel= (i64em only; L*X only) Tells the compiler to use a specific memory model to generate code and store data. This option can affect code size and per- formance. You can specify one of the following values for : ? small Restricts code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing. This is the default. ? medium Restricts code to the first 2GB; it places no memory restric- tion on data. Accesses of code can be done with IP-relative addressing, but accesses of data must be done with absolute addressing. ? large Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing. If your program has COMMON blocks and local data with a total size smaller than 2GB, -mcmodel=small is sufficient. COMMONs larger than 2GB require -mcmodel=medium or -mcmodel=large. Allocation of memory larger than 2GB can be done with any set- ting of -mcmodel. IP-relative addressing requires only 32 bits, whereas absolute addressing requires 64-bits. IP-relative addressing is somewhat faster. So, the small memory model has the least impact on per- formance. Note: When the medium or large memory models are specified, you must also specify option -shared-intel to ensure that the cor- rect dynamic versions of the Intel run-time libraries are used. When shared objects (.so files) are built, position-independent code (PIC) is specified so that a single .so file can support all three memory models. The compiler driver adds option -fpic to implement PIC. However, you must specify a memory model for code that is to be placed in a static library or code that will be linked stati- cally. David Mathog wrote: > Prentice Bisbal wrote: > >> Question: What is the maximum number of elements you can have in a >> double-precision array in Fortran? I have someone creating a >> 4-dimensional double-precision array. When they increase the dimenions >> of the array to ~200 million elements, they get this error: >> >> compilation aborted (code 1). > > The two things that come immediately to mind are: > > 1. The compiler ran out of memory. (In addition to the size of the > memory in the machine, check ulimit.) > > 2. The compiler is trying to build the program with 32 bit pointers and > it cannot address this array, or perhaps all memory accessed, with a > pointer of that size. If that is the issue using 64 bit pointers should > solve the problem, but I can't tell you what compiler switches are > needed to do this. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From h-bugge at online.no Tue Nov 3 13:30:05 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Tue, 3 Nov 2009 22:30:05 +0100 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF08350.8060107@ias.edu> References: <4AF0739E.8030700@ias.edu> <20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu> Message-ID: If it takes some time before it aborts, check the available size on your temp directory (usually /tmp). That might be the case if you're using IPO. H?kon On Nov 3, 2009, at 20:24 , Prentice Bisbal wrote: > Greg Lindahl wrote: >> On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: >> >>> That's exactly the compiler I'm using, and it's failing at ~200 >>> million >>> elements. I'm digging through the Intel documentation. Haven't >>> found an >>> answer yet. >> >> Your bug report was incomplete: it really matters if the array is >> automatic or not, or if it's initialized. >> > You're right - I should have included a code snippet. It's not my > code, > so I don't know if I can share all of it. Here's the line where the > problem occurs: > > dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2) > > If he reduces the 5000000 to a smaller number, it compiles. As > shown, he > gets this error: > > ifort adaptnew2.for > ... > ... > compilation aborted for adaptnew2.for (code 1) > > The compiler is Intel's ifort 11.0.074 > > I'm not a Fortran programmer, so I'm a little out of my element > here. If > it was bash or perl, or even C/C++, that'd be a different story. > > -- > Pretnice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > Mvh., H?kon Bugge h-bugge at online.no +47 924 84 514 From Michael.Frese at NumerEx-LLC.com Tue Nov 3 17:02:02 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Tue, 03 Nov 2009 18:02:02 -0700 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF08350.8060107@ias.edu> References: <4AF0739E.8030700@ias.edu> <20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu> Message-ID: <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com> I think Gus has it right. Your two arrays are 200 million floating point words each, perhaps 8 bytes per word, and therefore are approximately about 3 gigabytes total. That's too big for the default 'small' memory model. Certainly, there are no limits to array sizes in Fortran. Mike At 12:24 PM 11/3/2009, Prentice Bisbal wrote: >Greg Lindahl wrote: > > On Tue, Nov 03, 2009 at 01:17:02PM -0500, Prentice Bisbal wrote: > > > >> That's exactly the compiler I'm using, and it's failing at ~200 million > >> elements. I'm digging through the Intel documentation. Haven't found an > >> answer yet. > > > > Your bug report was incomplete: it really matters if the array is > > automatic or not, or if it's initialized. > > >You're right - I should have included a code snippet. It's not my code, >so I don't know if I can share all of it. Here's the line where the >problem occurs: > >dimension vstore(1:4,0:4,5000000,2),fstore(0:4,5000000,2) > >If he reduces the 5000000 to a smaller number, it compiles. As shown, he >gets this error: > >ifort adaptnew2.for >... >... >compilation aborted for adaptnew2.for (code 1) > >The compiler is Intel's ifort 11.0.074 > >I'm not a Fortran programmer, so I'm a little out of my element here. If >it was bash or perl, or even C/C++, that'd be a different story. > >-- >Pretnice >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From gerry.creager at tamu.edu Wed Nov 4 06:03:28 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 04 Nov 2009 08:03:28 -0600 Subject: [Beowulf] A look at the 100-core Tilera Gx In-Reply-To: <20091103173717.GT17686@leitl.org> References: <20091103173717.GT17686@leitl.org> Message-ID: <4AF189B0.2070200@tamu.edu> I think it was the recent IEEE Spectrum, where they talk about using the Tilera 100-core chips for HPC, tuned to a specific problem using FPGA for optimizing the chips to the problem. The argument is to use a lower-power system with huge numbers of cores and efficient on-chip switching, to replace the xommon x86(_64) architecture we've come to know and hate for its energy consumption and heat generation. Personal take: conventional systems will win this battle (ask SiCortex: a great idea overwhelmed by investors who couldn't see its longer-term benefits), but that we just might see changes to slower but more efficient cores. Via Epia-10k comes to mind, as do the Atom and several other variants. A little slower switching fabric (gigabit) with some changes to the core thinking of integration designers, will be required, but I think we could make that 20kcore Atom system using gigabit work pretty well compared to a 4k core Nehalem with QDR. The big thing is reworking our thinking: It costs a LOT (we've said this all before) to create the power and cooling infrastructure for serious HPC, and I'll posit now that "serious" requires at least 4k x86_64 cores in today's logic. If the cost of powering and cooling all this stuff is considered, it's a huge expense, but then, a lot of us are at academic institutions, and don't have to consider infrastructure... or didn't until recently. Example: I have no place to expand our HPC, since we've maxed out power and cooling in the machine room we're currently in. And, in the only reasonable space I can build out to expand into, power's $90K and cooling another $100K to expand, allowing an additional 20 racks. Of x86_64 and QDR. In fact, while I'll gain 20 racks of space, I'm not sure I can get 20 racks of cooling in place for that. I'm reasonably sure I can power the stuff for the $90K figure and even add sufficient generator to keep critical elements (cooling at a reduced level; HPC generally has no requirement for running during a power failure) to continue until a clean shutdown or power's restored. I like what I've read on the Tilera. I think it's got some potential, but I think it's time we consider taking our breed of HPC toward to Maker side of things, and begin hacking minimalist motherboards, adopting low-power devices, and generally reinvent the hardware stack as we knew it. gerry Eugen Leitl wrote: > http://www.semiaccurate.com/2009/10/29/look-100-core-tilera-gx/ > > A look at the 100-core Tilera Gx > > It's all about the network(s) > > by Charlie Demerjian > > October 29, 2009 > > TILERA IS CLAIMING to have the first commercial CPU to reach 100 cores, and > while this is true, the real interesting technology is in the interconnects. > The overall chip is quite a marvel, and it is unlike any mainstream CPU you > have ever heard of. > > Making a lot of cores on a chip isn't very hard. Larrabee for example has 32 > Pentium (P54) cores, heavily modified, as the basis of the GPU. If Intel > wanted to, it could put hundreds of cores on a die, that part is actually > quite easy. Keeping those cores fed is the most important problem of modern > chipmaking, and that part is not easy. > > Large caches, wide memory busses, ring busses on chip, stacking, and optical > interfaces all are attempts to feed the beast. Everyone thought Intel's > Polaris, also known as the 80 core, 1 TeraFLOPS part from a few years ago, > was about packing cores onto a die. It wasn't, it was a test of routing > algorithms and structures. Routing is where the action is now, packing cores > in is not a big deal. > > Routing is where Tilera shines. It has put a great deal of thought into > getting data from core to core with minimal latency and problems. Its rather > unique approach involves five different interconnect networks, programmable > partitioning, accelerators, and simply tons of I/O. Together, these allow > Tilera's third generation Tile-Gx CPUs to scale from 16 to 100 cores without > choking on congestion. They may not have the same single-threaded performance > of a Nehalem or Shanghai core, but they make up for it with volume. > > 100 core diagram > > Tilera 100 core chip > > The basic structure is a square array of small cores, 4x4, 6x6, 8x8 or 10x10, > each connected via five (5) on-chip networks, and flanked by some very > interesting accelerators. The cores themselves are a proprietary 32-bit ISA > in the first two generations of Tilera chips, and in the Gx, it is extended > to 64-bit. There are 75 new instructions in the Gx, 20 of which are SIMD, and > the memory controller now sees 64 bits as well. > > In previous generations, there was no floating-point (FP) hardware in Tilera > products. The company strongly recommended against using FP code because it > had to be emulated taking hundreds or thousands of cycles. With the new Gx > series chips, FP code is still frowned upon, but there is some FP hardware to > catch the odd instruction without a huge speed hit. The 100 core part can do > 50 GigaFLOPS of FP which may sound like a large number, but that is only > about 1/50th of what an ATI Cypress HD5870 chip can do. > > The majority of the new instructions are aimed at what the Tilera chips do > best, integer calculations. Things like shuffle and DSP-like > multiply-and-accumulate (MAC) functions, including a quad MAC unit, are where > these new chips shine. Basically, the Gx moves information around very > quickly while twiddling bits here and there with integer functions. > > While the cores might not be overly complex, the on-chip busses are. Each Gx > core has 64K of L1 cache, 32K data and 32K instruction, along with a unified > 8-way 256KB L2 cache. The cache is totally non-blocking, completely coherent, > and the cache subsystem can reorder requests to other caches or DRAM. On top > of this, the core supports cache pinning to keep often used data or > instructions in cache. On the 100 core model, the Gx has 32MB of cache. > > Tiles are the name Tilera uses for for a basic unit of repetition. The 16 > core Gx has 16 tiles, the 64 core Gx has 64, etc. A tile consists of a core, > the L1 and L2 caches, and something Tilera calls the Terabit Switch. More > than anything, this switch is the heart of the chip. > > Tile diagram > > A Tilera tile > > Remember when we said that cramming 100 cores on a die is not a big problem, > but feeding them is? The Terabit Switch is how Tilera solves the problem, and > it is a rather unique solution. Instead of one off-core bus, there are five. > Each of them has a dedicated purpose, and that not only gives huge bandwidth, > it also goes a fair way towards minimizing contention. Cache traffic will > never be stepped on by user data, and so on. > > The five networks are called QDN, RDN, FDN, IDN and UDN. In the last two > generations of Tilera chips, all of these networks were 32 bits wide, but on > the Gx, the widths vary to give each one more or less bandwidth depending on > their functions. > > QDN is called the reQuest Dynamic Network, and it is used for memory and > cache. QDN is 64 bits wide. RDN is Response Dynamic Network, and it is used > to feed memory reads back to the chips. RDN is 112 bits wide, an odd number, > 64 + 48 from the look of it. > > FDN is the widest at 128 bits, and it is used for cache to cache transfers > and cache coherency. Given the critical nature of cache transactions like > this, the width is no surprise. The last two IDN and UDN are both 32 bits > wide. IDN is I/O Dunamic Network, and passes data on and off the chip. With a > dedicated channel for off-chip transfers, you can see that reaching > theoretical numbers was a priority at Tilera. > > The last network UDN is for User Dynamic Network, basically the one users get > to send stuff around on. QDN, RDN, FDN and IDN are basically housekeeping, > they work in the background. If you want to send things from point A to point > B, you send it across the UDN. > > Although Tilera didn't explicitly state it, each hop from router to router > takes one cycle. This means that in a pathological case, corner core to > memory on the far corner, it could take 19 cycles to go from request to > memory, plus the memory round trip time, and then another 19 cycles to get > back. That is what you call a long time in computer speak. Even in an > 'average' case, you have a 10 cycle latency, which is very long as well. > > To be fair, the Tilera architecture is not made to run general purpose code. > As it was described when the first generation came out, workloads are meant > to be chunked up, so a single tile does a function, then the data gets passed > to the next tile for more work, and so on and so forth. If your program has > 20 steps, you use 20 tiles and pipeline the work. > > This solves many of the problems with variable latency and multi-hop traffic. > The other more elegant solution is the ability to section off chunks of the > chip into sub-units. There is a hypervisor that can partition each Gx chip > into programmable blocks. > > Chunking tiles > > Sub-sections of tiles > > As you can see in the diagram above, each Gx is broken up into sub-chips in > software. You can give each process as much CPU power as it needs, and > arrange it so the output of one block feeds into the input of the next in a > single clock. This example has two Apache web server instances, an intrusion > prevention system (IPS), a secure sockets layer (SSL) stack, a network stack > and a few other processes running next to each other. > > The Apache instances have their own memory controller, as do the IPS and the > SSL stack. The network stack is sitting on top of the memory controller for > decreased latency. Basically, the programmer can choose where to put each > process to minimize latency. It doesn't take much to figure out how to apply > these concepts to a database plus web server scenario, or a three-tiered > SAP-like workload. > > Basically, Tilera allows you to explicitly place the data and compute > resources where, when and how you need them. The chunks are done at roughly > the same level as hardware VMs are in x86 CPUs, running below the level that > a process can affect. This creates hardware walls to segregate data > transfers, cache coherency traffic, and other tile to tile transfers. If done > correctly, it can minimize latency a lot in addition to keeping processes > from stepping on each other. > > Now that you know how the cores work, talk, and are partitioned, what about > the 'uncore'? Talk about that starts with the memory controllers - four > DDR3-2133MHz banks on the 64 and 100 core Gx, two on the 16 and 36 core > models. For the keen eyed out there, this means Tilera has two different > socket configurations, one for the 64 and 100 core chips, and another one for > the 16 and 36 core chips. > > DDR3-2133MHz memory is very fast, hugely fast in fact. The math says 17GBps > per contr. Basically, this chip has a lot of available bandwidth. As you > might imagine, on the 16 and 36 core variants, there are only half the > controllers, so half the bandwidth. > > In addition, you have a generic controller for USB, UARTs, JTAG and I2C > controllers. Given that Tilera chips are basically embedded, these are not > likely to be used for much more than booting and diagnostics. > > On the core diagram above, there are two other blocks, the orange MiCA and > mPIPE accelerators. These are where the other parts of the Tilera Gx 'magic' > happen. MiCA stands for Multistream iMesh Crypto Accelerator, while mPIPE is > short for multicore Programmable Intelligent Packet Engine. If it isn't > blindingly obvious, the MiCA does the crypto and the mPIPE speeds up I/O. > > The mPIPE does a lot of interesting things, all supposedly at wire speed. It > has a programmable packet classification engine, said to be usable at 80Gbps > or 120M packets per second. It can twiddle headers and do other evil things > that would make Comcast drool with the potential for 'network management' > extortion payements. > > In addition, it can also load balance across the various I/O lanes, and > redirect tile to tile 'I/O' in a somewhat intelligent fashion. On top of > that, the mPIPE manages buffer sizes, queues, and other housekeeping to keep > latencies low. Think of it as a programmable housekeeping offload engine. > > The most interesting bit is that the mPIPE can tag a packet with a 32 bit > header before it sends it onto the internal network. This is where the > programmable part shines. You can set up fields in the I/O packet itself to > pass along pre-decode information and other time-saving tidbits. Since I/O is > fully virtualizable, you could theoretically tag the packets with VM data, or > just about anything else a bored programmer can think of. > > The MiCA engines, two on the 64/100 core, one on 16/36 cores, are crypto > offload engines. They can work either 'inline' or as ull blown offload > engines, that is up to the programmer. The MiCA can pull data directly from > caches or main memory without CPU overhead, basically fire and forget. > > If you like acronyms, the MiCA on the Gx can support AES, 3DES, ARC4, Kasumi > and Snow for crypto, SHA-1, SHA-2, MD5, HMAC and AES-GMAC for hashes, RSA, > DSA, Diffie-Hellman, and Elliptic Curve for public key work, and it has a > true random number generator (RNG). WTF, LOL, ROFL and other netspeak can be > encrypted along with any other text that uses correct grammar. RLY. > > Tilera claims that the MiCA engine can do wire speed 40Gbps crypto with full > duplex on the 100 core Gx, and 1024b key RSA at 50K keys per second on the > 100 core, 20K keys per second for the 36 core. Not bad at all. In addition, > the MiCA supports a hardware compression engine that uses the tried and true > Deflate algorithm. > > The last piece of the puzzle is something that Tilera calls external > acceleration interfaces. This could be as simple as plugging in a PCIe card, > but that lacks elegance. The interesting part is a field programmable gate > array (FPGA) interface. You can take up to 8 lanes of PCIe and connect the > FPGA to the serial deserial unit (SerDes) to enable basically direct and low > latency 32Gbps transfers. Direct transfers to cache and multiple contexts are > supported, meaning you can do quite a bit with an FPGA and a Tilera-Gx chip. > > In the end, you have a monster chip for I/O and packet processing. It doesn't > do single-threaded applications all that fast, but it really isn't meant to. > The chip itself is not out yet, nor is there even silicon yet. The first > version out will be the 36 core Gx in Q4 of 2010, followed by the 16 core > later in Q4 or possibly Q1 of 2011. These both share the same socket > configuration and a 35*35mm package. > > In Q1 of 2011, the 100 core chip will come out on a new socket and in a > 45*45mm package. A bit after that, the 64 core will hit the market. Power > ranges from 10W for the 16 core to 55W for the 100 core, but you can get > power optimized variants that will only suck 35W. Given the programmability > of the parts, power use is likely more dependent on the programs running on > it. > > The last bit of information is clock speeds. The 64 and 100 core models will > come in versions that run at 1.25GHz and 1.5GHz, not bad considering how much > there is to synchronize and keep going. The 36 core models will come in > 1.0GHz, 1.25GHz and 1.5GHz versions, and the 16 core models will only come in > 1.0GHz or 1.25GHz versions. Given the core count, internal interconnections, > memory and I/O capabilities, Tilera will pack a lot of power into these small > packages.S|A > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Nov 4 10:36:09 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 4 Nov 2009 10:36:09 -0800 Subject: [Beowulf] A look at the 100-core Tilera Gx In-Reply-To: <4AF189B0.2070200@tamu.edu> References: <20091103173717.GT17686@leitl.org> <4AF189B0.2070200@tamu.edu> Message-ID: <20091104183609.GA6626@bx9.net> On Wed, Nov 04, 2009 at 08:03:28AM -0600, Gerry Creager wrote: > And, > in the only reasonable space I can build out to expand into, power's $90K > and cooling another $100K to expand, allowing an additional 20 racks. Uh, I'm missing how this is a big problem... $200k or $300k of capital costs to get 20 more racks. Let's assume that you unfairly have to pay the whole capital cost up front. How much does the equipment to fill those racks cost? A lot more than $200k, unless you have a really low power density, or are buying nodes with small ram, no high speed network, etc. So it's annoying, but not a show-stopper. If you had to build a new building or addition, yeah, that would really hurt. This is the fundamental problem that low power startups face. They have a huge advantage when the machineroom is of a fixed capacity, or if capital costs aren't accounted for properly. They have a modest advantage for organizations that can capitalize things. The first market isn't big enough, and the big market in which they only have a modest advantage won't pay enough of a premium. Game over. -- greg From cbergstrom at pathscale.com Tue Nov 3 17:27:22 2009 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Tue, 03 Nov 2009 20:27:22 -0500 Subject: [Beowulf] Fortran Array size question In-Reply-To: <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com> References: <4AF0739E.8030700@ias.edu> <20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu> <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com> Message-ID: <4AF0D87A.60501@pathscale.com> Michael H. Frese wrote: > I think Gus has it right. Your two arrays are 200 million floating > point words each, perhaps 8 bytes per word, and therefore are > approximately about 3 gigabytes total. > > That's too big for the default 'small' memory model. > > Certainly, there are no limits to array sizes in Fortran. If you need a 64bit Fortran compiler ping me off list. (PathScale not iFort of course) I was working with some extremely large code a few months ago and one of our engineers could probably polish up my patch and send out test binaries to anyone interested. All we would ask for in this specific case is feedback. Our performance on floating point code should be better than Intel's. If you find to the contrary that's a bug we're happy to address ./C From niftyompi at niftyegg.com Wed Nov 4 16:36:20 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed, 4 Nov 2009 16:36:20 -0800 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF0D87A.60501@pathscale.com> References: <4AF0739E.8030700@ias.edu> <20091103183927.GC16399@bx9.net> <4AF08350.8060107@ias.edu> <6.2.5.6.2.20091103175916.0973d718@NumerEx-LLC.com> <4AF0D87A.60501@pathscale.com> Message-ID: <20091105003620.GA2166@hpegg.wr.niftyegg.com> On Tue, Nov 03, 2009 at 08:27:22PM -0500, "C. Bergstr?m" wrote: > Michael H. Frese wrote: > >I think Gus has it right. Your two arrays are 200 million > >floating point words each, perhaps 8 bytes per word, and therefore > >are approximately about 3 gigabytes total. > > > >That's too big for the default 'small' memory model. > > > >Certainly, there are no limits to array sizes in Fortran. > If you need a 64bit Fortran compiler ping me off list. (PathScale > not iFort of course) I was working with some extremely large code a > few months ago and one of our engineers could probably polish up my > patch and send out test binaries to anyone interested. All we would > ask for in this specific case is feedback. > > Our performance on floating point code should be better > than Intel's. If you find to the contrary that's a bug we're happy > to address If I recall the pathscale compiler also needs a flag to establish the memory model at compile time. One thing I cannot answer off the top of my head is if there is a need to also establish the memory model for libraries. Since all libs are not the same some will and some will not. If there is any doubt when debugging check that the compile flags are consistant. From prentice at ias.edu Thu Nov 5 10:33:07 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 05 Nov 2009 13:33:07 -0500 Subject: [Beowulf] Fortran Array size question In-Reply-To: <4AF063B3.3050508@ias.edu> References: <4AF063B3.3050508@ias.edu> Message-ID: <4AF31A63.7040803@ias.edu> Prentice Bisbal wrote: > This question is a bit off-topic, but since it involves Fortran minutia, > I figured this would be the best place to ask. This code may eventually > run on my cluster, so it's not completely off topic! > > Question: What is the maximum number of elements you can have in a > double-precision array in Fortran? I have someone creating a > 4-dimensional double-precision array. When they increase the dimenions > of the array to ~200 million elements, they get this error: > > compilation aborted (code 1). > > I'm sure they're hitting a Fortran limit, but I need to prove it. I > haven't been able to find anything using The Google. > Everyone - thanks for all the help. I was inundated with suggestions, which I've passed along to the researcher I'm helping. I'm sure one of them will help us get the code to compile and run. Thanks again. -- Prentice From deadline at eadline.org Fri Nov 6 06:22:10 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 6 Nov 2009 09:22:10 -0500 (EST) Subject: [Beowulf] The 2009 Beowulf Bash In-Reply-To: <4AF31A63.7040803@ias.edu> References: <4AF063B3.3050508@ias.edu> <4AF31A63.7040803@ias.edu> Message-ID: <46622.192.168.1.213.1257517330.squirrel@mail.eadline.org> For those of you who have been waiting all year for this event, check out the following URL http://www.xandmarketing.com/beobash09/ Back Story: A certain local Portland company who must remain anonymous has donated 5 kegs of local beer they had brewed for SC09. I'm sure if you compile a list of possible Portland based HPC companies you will find it is a rather small group ;-) -- Doug From amjad11 at gmail.com Fri Nov 6 14:43:38 2009 From: amjad11 at gmail.com (amjad ali) Date: Fri, 6 Nov 2009 17:43:38 -0500 Subject: [Beowulf] Programming Help needed Message-ID: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com> Hi all, I need/request some help from those who have some experience in debugging/profiling/tuning parallel scientific codes, specially for PDEs/CFD. I have parallelized a Fortran CFD code to run on Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that: Suppose that the grid/mesh is decomposed for n number of processors, such that each processors has a number of elements that share their side/face with different processors. What I do is that I start non blocking MPI communication at the partition boundary faces (faces shared between any two processors) , and then start computing values on the internal/non-shared faces. When I complete this computation, I put WAITALL to ensure MPI communication completion. Then I do computation on the partition boundary faces (shared-ones). This way I try to hide the communication behind computation. Is it correct? IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) with an another processor B then it sends/recvs 50 different messages. So in general if a processors has X number of faces sharing with any number of other processors it sends/recvs that much messages. Is this way has "very much reduced" performance in comparison to the possibility that processor A will send/recv a single-bundle message (containg all 50-faces-data) to process B. Means that in general a processor will only send/recv that much messages as the number of processors neighbour to it. It will send a single bundle/pack of messages to each neighbouring processor. Is their "quite a much difference" between these two approaches? THANK YOU VERY MUCH. AMJAD. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Sat Nov 7 03:11:19 2009 From: stewart at serissa.com (Larry Stewart) Date: Sat, 7 Nov 2009 06:11:19 -0500 Subject: [Beowulf] Programming Help needed In-Reply-To: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com> References: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com> Message-ID: <77b0285f0911070311y111b4bd4o1d8e49994bc017e2@mail.gmail.com> On Fri, Nov 6, 2009 at 5:43 PM, amjad ali wrote: > Hi all, > > > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less > elements) with an another processor B then it sends/recvs 50 different > messages. So in general if a processors has X number of faces sharing with > any number of other processors it sends/recvs that much messages. Is this > way has "very much reduced" performance in comparison to the possibility > that processor A will send/recv a single-bundle message (containg all > 50-faces-data) to process B. Means that in general a processor will only > send/recv that much messages as the number of processors neighbour to it. > It will send a single bundle/pack of messages to each neighbouring > processor. > Is their "quite a much difference" between these two approaches? > > It is probably faster to send a single message with all the data, rather than fifty messages, especially if each item is small. However, you don't have to guess. Just create a small test program and use MPI_WTIME to measure how long the two cases take. The usual way to do timing measurements that gets decent results is to measure the time for one iteration of the two cases, then measure the time for two iterations, then 4, then 8, and so on until the time for a run exceeds one second. The issues that make it likely that one big message is faster than 50 small ones are that copying the data into a single message on a modern processor will be much faster than sending the bits over ethernet, and that each message has a certain overhead, which is probably large compared to the copy and transmission time of a small datum. If you will be writing MPI programs for various problems, it might be useful to download and run something like the Intel MPI Tests, that will give you performance figures for the various MPI operations and give you a feel for how expensive different things are on your system. -L -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Sat Nov 7 03:22:04 2009 From: stewart at serissa.com (Larry Stewart) Date: Sat, 7 Nov 2009 06:22:04 -0500 Subject: [Beowulf] Programming Help needed In-Reply-To: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com> References: <428810f20911061443r456e04ccu81e80cd8eb51eab4@mail.gmail.com> Message-ID: <77b0285f0911070322o5c29c8bbk13f78178b2cef8a0@mail.gmail.com> On Fri, Nov 6, 2009 at 5:43 PM, amjad ali wrote: > Hi all, > > > Suppose that the grid/mesh is decomposed for n number of processors, such > that each processors has a number of elements that share their side/face > with different processors. What I do is that I start non blocking MPI > communication at the partition boundary faces (faces shared between any two > processors) , and then start computing values on the internal/non-shared > faces. When I complete this computation, I put WAITALL to ensure MPI > communication completion. Then I do computation on the partition boundary > faces (shared-ones). This way I try to hide the communication behind > computation. Is it correct? > > There are two issues here. First, correctness. The data for messages that arrive while you are computing may be written into memory asynchronously with respect to your program. Be sure that you are not depending on values in memory that may be overwritten by data arriving from other ranks. Second, overlap is good, but whether you actually get any overlap depends on the details. For example, the work of communicating with other ranks and sending messages and so forth must be done by something. For ethernet, there will be a lot of work done by the OS kernel and in general by some core on each node. If you expect to be using all the cores in a node to run your program, who is left to do the communications work? Some implementations will timeshare the processors, giving the appearance of overlap, but not actually running faster, while other implementations simply won't do any work until the WAITALL that demands progress. If you have multicore nodes, and you don't need every last core to run your program, it can help if you only allocate some of the cores on each node to your program, leaving some "idle" to run the OS and the communications. The job control system should have a way to do this. You can test to find out if you are getting any overlap, by artificially reducing the actual communications work to near zero and seeing if the program runs any faster. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Mon Nov 9 10:19:39 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 9 Nov 2009 18:19:39 -0000 Subject: [Beowulf] Kingston 40Gbyte SSD Message-ID: <68A57CCFD4005646957BD2D18E60667B0DC8CE0A@milexchmb1.mil.tagmclarengroup.com> Might make a nice cluster node drive: http://www.reghardware.co.uk/2009/11/09/review_storage_kingston_ssd_now_ v_40gb/ The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From joshua_mora at usa.net Fri Nov 6 15:29:58 2009 From: joshua_mora at usa.net (Joshua mora acosta) Date: Fri, 06 Nov 2009 17:29:58 -0600 Subject: [Beowulf] Programming Help needed Message-ID: <466NkFXC78668S06.1257550198@cmsweb06.cms.usa.net> Just try it and you'll understand what it means communication overhead.... most of these apps are network latency dominated: small messages but lots because of i) many neighbor processors involved and iterative process. Packing all the faces that need to be exchanges is the right way to go. You can also think in having a dedicated thread for handling the communications and the remaining ones for computation at the compute node level. So you really get good overlapping of computation and commputation. Joshua ------ Original Message ------ Received: 04:52 PM CST, 11/06/2009 From: amjad ali To: Beowulf Mailing List Subject: [Beowulf] Programming Help needed > Hi all, > > I need/request some help from those who have some experience in > debugging/profiling/tuning parallel scientific codes, specially for > PDEs/CFD. > > I have parallelized a Fortran CFD code to run on > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is that: > > Suppose that the grid/mesh is decomposed for n number of processors, such > that each processors has a number of elements that share their side/face > with different processors. What I do is that I start non blocking MPI > communication at the partition boundary faces (faces shared between any two > processors) , and then start computing values on the internal/non-shared > faces. When I complete this computation, I put WAITALL to ensure MPI > communication completion. Then I do computation on the partition boundary > faces (shared-ones). This way I try to hide the communication behind > computation. Is it correct? > > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less elements) > with an another processor B then it sends/recvs 50 different messages. So in > general if a processors has X number of faces sharing with any number of > other processors it sends/recvs that much messages. Is this way has "very > much reduced" performance in comparison to the possibility that processor A > will send/recv a single-bundle message (containg all 50-faces-data) to > process B. Means that in general a processor will only send/recv that much > messages as the number of processors neighbour to it. It will send a single > bundle/pack of messages to each neighbouring processor. > Is their "quite a much difference" between these two approaches? > > THANK YOU VERY MUCH. > AMJAD. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From amjad11 at gmail.com Tue Nov 10 10:21:08 2009 From: amjad11 at gmail.com (amjad ali) Date: Tue, 10 Nov 2009 13:21:08 -0500 Subject: [Beowulf] MPI Coding help needed Message-ID: <428810f20911101021x6421743dx10487632a648d141@mail.gmail.com> Hi all. (sorry for duplication, if it is) I have to parallelize a CFD code using domain/grid/mesh partitioning among the processes. Before running, we do not know, (i) How many processes we will use ( np is unknown) (ii) A process will have how many neighbouring processes (my_nbrs = ?) (iii) How many entries a process need to send to a particular neighbouring process. But when the code run, I calculate all of this info easily. The problem is to copy a number of entries to an array then send that array to a destination process. The same sender has to repeat this work to send data to all of its neighbouring processes. Is this following code fine: DO i = 1, my_nbrs DO j = 1, few_entries_for_this_neighbour send_array(j) = my_array(jth_particular_entry) ENDDO CALL MPI_ISEND(send_array(1:j),j, MPI_REAL8, dest(i), tag, MPI_COMM_WORLD, request1(i), ierr) ENDDO And the corresponding receives, at each process: DO i = 1, my_nbrs k = few_entries_from_this_neighbour CALL MPI_IRECV(recv_array(1:k),k, MPI_REAL8, source(i), tag, MPI_COMM_WORLD, request2(i), ierr) DO j = 1, few_from_source(i) received_data(j) = recv_array(j) ENDDO ENDDO After the above MPI_WAITALL. I think this code will not work. Both for sending and receiving. For the non-blocking sends we cannot use send_array to send data to other processes like above (as we are not sure for the availability of application buffer for reuse). Am I right? Similar problem is with recv array; data from multiple processes cannot be received in the same array like above. Am I right? Target is to hide communication behind computation. So need non blocking communication. As we do know value of np or values of my_nbrs for each process, we cannot decide to create so many arrays. Please suggest solution. =================== A more subtle solution that I could assume is following: cc = 0 DO i = 1, my_nbrs DO j = 1, few_entries_for_this_neighbour send_array(cc+j) = my_array(jth_particular_entry) ENDDO CALL MPI_ISEND(send_array(cc:cc+j),j, MPI_REAL8, dest(i), tag, MPI_COMM_WORLD, request1(i), ierr) cc = cc + j ENDDO And the corresponding receives, at each process: cc = 0 DO i = 1, my_nbrs k = few_entries_from_this_neighbour CALL MPI_IRECV(recv_array(cc+1:cc+k),k, MPI_REAL8, source(i), tag, MPI_COMM_WORLD, request2(i), ierr) DO j = 1, k received_data(j) = recv_array(cc+j) ENDDO cc = cc + k ENDDO After the above MPI_WAITALL. Means that, send_array for all neighbours will have a collected shape: send_array = [... entries for nbr 1 ..., ... entries for nbr 1 ..., ..., ... entries for last nbr ...] And the respective entries will be send to respective neighbours as above. recv_array for all neighbours will have a collected shape: recv_array = [... entries from nbr 1 ..., ... entries from nbr 1 ..., ..., ... entries from last nbr ...] And the entries from the processes will be received at respective locations/portion in the recv_array. Is this scheme is quite fine and correct. I am in search of efficient one. Request for help. With best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From a28427 at ua.pt Tue Nov 10 10:31:41 2009 From: a28427 at ua.pt (Tiago Marques) Date: Tue, 10 Nov 2009 18:31:41 +0000 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: Hi all, Sorry to ressurect this thread after all this time but I just figured out the problem with VASP by chance. VASP's INCAR file accepts one parameter that both fixes scalability problems and increases performance at the same time, even if you still stick to 6 cores. That parameter is NPAR. I was recommended to set NPAR=2 for most calculations and it worked great. Still, I experimented a bit and NPAR=1 and it gave even better results. It seems VASP, by default, is using NPAR=NCPUS, which cripples performance if you don't use multiples of 3. " running on 8 nodes distr: one band on 8 nodes, 1 groups" This is with NPAR=1 NPAR=2 gives something like: " running on 8 nodes distr: one band on 4 nodes, 2 groups" Enjoy the performance increase, if you haven't still. To us it increased around 33% in conjunction with running 8 CPUs. It seems to me that groups may be useful to run with more nodes and not just one machine but I haven't had the chance to test that out. On Tue, Aug 11, 2009 at 6:57 PM, Rahul Nabar wrote: > On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney wrote: >> What are you doing to ensure that you have both memory and processor >> affinity enabled? >> > > All I was using now was the flag: > > --mca mpi_paffinity_alone 1 I was actually using that on the Xeons 54xx, since the processors aren't native quad-cores, the kernel would keep threads bouncing from core to core to achive a proper load balance. This was the best it could do and I managed to get about 3% better performance from using that together with disabling some kernel option I don't quite remember right now, so the threads wouldn't jump around anymore. If you didn't disabled the load balancing the code would inevitably mis-schedule and the code would end up running with only 5 cores(or from start) and calculations would take around 10x longer. This was only useful in 6 cores per node, as then each processor would be running precisely 3 threads. With eight I haven't tried it but I assume the advantage would be negligible. Best regards, Tiago Marques > > Is there anything else I ought to be doing as well? > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From spidus000 at gmail.com Mon Nov 9 19:57:08 2009 From: spidus000 at gmail.com (SpiduS Okami) Date: Tue, 10 Nov 2009 01:57:08 -0200 Subject: [Beowulf] Configuring HPCC Message-ID: Hello To All! This is my first post in beowulf list. I am looking for a help to configure my HPC Challenge test. I will be graduating in Computer science soon and,before that, I need to deliver a graduation work. I choosed to compare a "supercomputer" with a cluster of old machines, trying to proove that a cluster of old machines can cost less and process as much information as a supercomputer. For that I must maximize the capacity of the information processed by the supercomputer and the cluster. My configuration of "supercomputer" is: AMD Phenom 9600 Quad-Core 4 GB DDR2 800mhz RAM 500 GB HD - 40 GB for UBUNTU 9.10 HPCC 1.3.1 installed by deb package. Thak you! Att., Victor Bruno Alexander. From amjad11 at gmail.com Tue Nov 10 17:30:30 2009 From: amjad11 at gmail.com (amjad ali) Date: Tue, 10 Nov 2009 20:30:30 -0500 Subject: [Beowulf] array shape difference Message-ID: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> HI, suppose we have four arrays with same number of elements say 60000., but different dimensions like: array1(1:60000) array2(1:2, 1:30000) array3(1:2, 1:300, 1:100) array4(1:4, 1:15, 1:10, 1:100) Does each of these arrays in fortran will occupy same amount of memory? For sending/receiving each of these, Does MPI has the same (or nearly same) overhead? or any significantly different overhead is involved in handling each of these arrays (by MPI)? with best regards, Amjad Ali -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Tue Nov 10 17:43:13 2009 From: stewart at serissa.com (Larry Stewart) Date: Tue, 10 Nov 2009 20:43:13 -0500 Subject: [Beowulf] array shape difference In-Reply-To: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> Message-ID: <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> On Tue, Nov 10, 2009 at 8:30 PM, amjad ali wrote: > HI, > > suppose we have four arrays with same number of elements say 60000., but > different dimensions like: > > array1(1:60000) > array2(1:2, 1:30000) > array3(1:2, 1:300, 1:100) > array4(1:4, 1:15, 1:10, 1:100) > > > Does each of these arrays in fortran will occupy same amount of memory? > > For sending/receiving each of these, Does MPI has the same (or nearly same) > overhead? or any significantly different overhead is involved in handling > each of these arrays (by MPI)? > > They should take the same amount of space and have nearly identical transfer times with MPI. (If you send the whole thing) -L -------------- next part -------------- An HTML attachment was scrubbed... URL: From Michael.Frese at NumerEx-LLC.com Wed Nov 11 06:01:58 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Wed, 11 Nov 2009 07:01:58 -0700 Subject: [Beowulf] array shape difference In-Reply-To: <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.co m> References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> Message-ID: <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> At 06:43 PM 11/10/2009, Larry Stewart wrote: >On Tue, Nov 10, 2009 at 8:30 PM, amjad ali ><amjad11 at gmail.com> wrote: >HI, > >suppose we have four arrays with same number of elements say 60000., >but different dimensions like: > >array1(1:60000) >array2(1:2, 1:30000) >array3(1:2, 1:300, 1:100) >array4(1:4, 1:15, 1:10, 1:100) > > >Does each of these arrays in fortran will occupy same amount of memory? > >For sending/receiving each of these, Does MPI has the same (or >nearly same) overhead? or any significantly different overhead is >involved in handling each of these arrays (by MPI)? > >They should take the same amount of space and have nearly identical >transfer times with MPI. >(If you send the whole thing) > >-L Those four array descriptors can all apply to exactly the same space, via an 'equivalence' statement. They are all laid out in memory just like array1. Thus, they can each be transmitted by exactly the same MPI send. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjad11 at gmail.com Wed Nov 11 09:04:25 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 11 Nov 2009 12:04:25 -0500 Subject: [Beowulf] MPI Derived datatype + Persistent Message-ID: <428810f20911110904u5efcd310h215ab67ddcbf917a@mail.gmail.com> Hi all, I read that MPI Derived datatypes may provide efficient way to send data non-contiguous in the memory. MPI Persistent communication may provide efficient way in case some specified/fix communication is performed in an iterative code. Can we combine both together to get some enhanced benefit/efficiency? Better if any body can refer to some tutorial/example-code on this. Thank you for you attention. With best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From peter.st.john at gmail.com Wed Nov 11 14:40:00 2009 From: peter.st.john at gmail.com (Peter St. John) Date: Wed, 11 Nov 2009 17:40:00 -0500 Subject: [Beowulf] array shape difference In-Reply-To: <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> Message-ID: The difference between: array1(1:60000) array2(1:2, 1:30000) would be reflected in the size of the executable, not the size of the data. Right? Peter On Wed, Nov 11, 2009 at 9:01 AM, Michael H. Frese < Michael.Frese at numerex-llc.com> wrote: > At 06:43 PM 11/10/2009, Larry Stewart wrote: > > > On Tue, Nov 10, 2009 at 8:30 PM, amjad ali wrote: > HI, > > suppose we have four arrays with same number of elements say 60000., but > different dimensions like: > > array1(1:60000) > array2(1:2, 1:30000) > array3(1:2, 1:300, 1:100) > array4(1:4, 1:15, 1:10, 1:100) > > > Does each of these arrays in fortran will occupy same amount of memory? > > For sending/receiving each of these, Does MPI has the same (or nearly same) > overhead? or any significantly different overhead is involved in handling > each of these arrays (by MPI)? > > They should take the same amount of space and have nearly identical > transfer times with MPI. > (If you send the whole thing) > > -L > > > Those four array descriptors can all apply to exactly the same space, via > an 'equivalence' statement. They are all laid out in memory just like > array1. > > Thus, they can each be transmitted by exactly the same MPI send. > > > Mike > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Michael.Frese at NumerEx-LLC.com Thu Nov 12 04:18:41 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Thu, 12 Nov 2009 05:18:41 -0700 Subject: [Beowulf] array shape difference In-Reply-To: References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> Message-ID: <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> That's correct. The executable size would reflect the extra operations required to compute the offset for the doubly dimensioned array. Mike At 03:40 PM 11/11/2009, Peter St. John wrote: >The difference between: >array1(1:60000) >array2(1:2, 1:30000) > >would be reflected in the size of the executable, not the size of the data. >Right? >Peter > > >On Wed, Nov 11, 2009 at 9:01 AM, Michael H. Frese ><Michael.Frese at numerex-llc.com> wrote: >At 06:43 PM 11/10/2009, Larry Stewart wrote: > > >>On Tue, Nov 10, 2009 at 8:30 PM, amjad ali >><amjad11 at gmail.com> wrote: >>HI, >>suppose we have four arrays with same number of elements say >>60000., but different dimensions like: >>array1(1:60000) >>array2(1:2, 1:30000) >>array3(1:2, 1:300, 1:100) >>array4(1:4, 1:15, 1:10, 1:100) >> >>Does each of these arrays in fortran will occupy same amount of memory? >>For sending/receiving each of these, Does MPI has the same (or >>nearly same) overhead? or any significantly different overhead is >>involved in handling each of these arrays (by MPI)? >> >>They should take the same amount of space and have nearly identical >>transfer times with MPI. >>(If you send the whole thing) >> >>-L > >Those four array descriptors can all apply to exactly the same >space, via an 'equivalence' statement. They are all laid out in >memory just like array1. > >Thus, they can each be transmitted by exactly the same MPI send. > > >Mike > >_______________________________________________ >Beowulf mailing list, >Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuartb at 4gh.net Thu Nov 12 05:26:13 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Thu, 12 Nov 2009 08:26:13 -0500 (EST) Subject: [Beowulf] array shape difference In-Reply-To: <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> Message-ID: At 03:40 PM 11/11/2009, Peter St. John wrote: > The difference between: > array1(1:60000) > array2(1:2, 1:30000) > > would be reflected in the size of the executable, not the size > of the data. > Right? On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote: > That's correct.? The executable size would reflect the extra operations > required to compute the offset for the doubly dimensioned array. Or maybe not. If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j)) it would likely generate about the same code as the compiler would generate for 2 dimensions. In theory, the compiler can generate better subscript computation but I suspect in most reasonable (or simple testing) cases the actual code size difference is a wash. Go with what is most natural for expressing the algorithm. And ease the future maintenance. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From Michael.Frese at NumerEx-LLC.com Thu Nov 12 06:50:31 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Thu, 12 Nov 2009 07:50:31 -0700 Subject: [Beowulf] array shape difference In-Reply-To: References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> Message-ID: <6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com> At 06:26 AM 11/12/2009, Stuart Barkley wrote: >At 03:40 PM 11/11/2009, Peter St. John wrote: > > The difference between: > > array1(1:60000) > > array2(1:2, 1:30000) > > > > would be reflected in the size of the executable, not the size > > of the data. > > Right? > >On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote: > > > That's correct. The executable size would reflect the extra operations > > required to compute the offset for the doubly dimensioned array. > >Or maybe not. > > >If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j)) >it would likely generate about the same code as the compiler would >generate for 2 dimensions. In theory, the compiler can generate >better subscript computation but I suspect in most reasonable (or >simple testing) cases the actual code size difference is a wash. > > >Go with what is most natural for expressing the algorithm. And ease >the future maintenance. > >Stuart Agreed. The code size differences would compiler dependent and minimal in any case. Human readability should determine the choice. Mike From amjad11 at gmail.com Thu Nov 12 07:02:38 2009 From: amjad11 at gmail.com (amjad ali) Date: Thu, 12 Nov 2009 09:02:38 -0600 Subject: [Beowulf] array shape difference In-Reply-To: <6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com> References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> <6.2.5.6.2.20091112074734.0a4a2fe8@NumerEx-LLC.com> Message-ID: <428810f20911120702n47b91b8cn3ea38b055e1b244e@mail.gmail.com> Hi all and Thanks you all. It is making quite good sense. On Thu, Nov 12, 2009 at 8:50 AM, Michael H. Frese < Michael.Frese at numerex-llc.com> wrote: > At 06:26 AM 11/12/2009, Stuart Barkley wrote: > >> At 03:40 PM 11/11/2009, Peter St. John wrote: >> > The difference between: >> > array1(1:60000) >> > array2(1:2, 1:30000) >> > >> > would be reflected in the size of the executable, not the size >> > of the data. >> > Right? >> >> On Thu, 12 Nov 2009 at 07:18 -0000, Michael H. Frese wrote: >> >> > That's correct. The executable size would reflect the extra operations >> > required to compute the offset for the doubly dimensioned array. >> >> Or maybe not. >> >> >> If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j)) >> it would likely generate about the same code as the compiler would >> generate for 2 dimensions. In theory, the compiler can generate >> better subscript computation but I suspect in most reasonable (or >> simple testing) cases the actual code size difference is a wash. >> >> >> Go with what is most natural for expressing the algorithm. And ease >> the future maintenance. >> >> Stuart >> > > Agreed. The code size differences would compiler dependent and minimal in > any case. Human readability should determine the choice. > > > > Mike > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Thu Nov 12 16:21:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 12 Nov 2009 18:21:59 -0600 Subject: [Beowulf] bizarre scaling behavior on a Nehalem In-Reply-To: References: <4A81ACF3.60802@noaa.gov> Message-ID: On Tue, Nov 10, 2009 at 12:31 PM, Tiago Marques wrote: > Hi all, > > Enjoy the performance increase, if you haven't still. To us it > increased around 33% in conjunction with running 8 CPUs. It seems to > me that groups may be useful to run with more nodes and not just one > machine but I haven't had the chance to test that out. THis is very interesting and promising Tiago. I still have not solved my VASP scaling woes. I am going to give your fix a shot now. -- Rahul From lindahl at pbm.com Thu Nov 12 16:46:31 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 12 Nov 2009 16:46:31 -0800 Subject: [Beowulf] array shape difference In-Reply-To: References: <428810f20911101730t45284baalcc0b5824091d8d7c@mail.gmail.com> <77b0285f0911101743x46cbd77bl147b63efc803042c@mail.gmail.com> <6.2.5.6.2.20091111065622.0a466228@NumerEx-LLC.com> <6.2.5.6.2.20091112051712.067d6e40@NumerEx-LLC.com> Message-ID: <20091113004631.GE11974@bx9.net> On Thu, Nov 12, 2009 at 08:26:13AM -0500, Stuart Barkley wrote: > > If the fortran code is doing virtual subscripts (e.g. array2(i*2 + j)) > it would likely generate about the same code as the compiler would > generate for 2 dimensions. In theory, the compiler can generate > better subscript computation but I suspect in most reasonable (or > simple testing) cases the actual code size difference is a wash. > Putting my "I used to work near a compiler group" hat on, I suspect a good compiler guy would tell you that they've worked hard to make sure both methods generate the same code for address computation. Strength reduction and the like are elementary optimizations these days. However, there is an issue that the compiler may have a better idea of the dimensions of the 2-dimensional array at compile time, leading to better optimization. That has nothing to do with the address computations, but everything to do with loop fusion, splitting, unrolling, pipelining, SIMDizing, cache effects, etc. -- greg From rpnabar at gmail.com Thu Nov 12 16:47:42 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 12 Nov 2009 18:47:42 -0600 Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS Message-ID: Has anyone tried out UEFI (Unified Extensible Firmware Interface) in the BIOS? The new servers I am buying come with this option in the BIOS. Out of curiosity I googled it up. I am not sure if there were any HPC implications of this and wanted to double check before I switched to this from my conventional plain-vanilla BIOS. Any sort of "industry standard" always sounds good but I thought it safer to check on the group first.... Any advice or pitfalls? -- Rahul From christiansuhendra at gmail.com Thu Nov 12 04:33:22 2009 From: christiansuhendra at gmail.com (christian suhendra) Date: Thu, 12 Nov 2009 04:33:22 -0800 Subject: [Beowulf] ask about mpich Message-ID: halo guys i wants to make a cluster system with mpich in ubuntu,,but i have troubleshooting with mpich.. but when i run the example program in mpich..it doesn't work in cluster..but i've registered the node on machine.LINUX.. but still not working please help me..this is my thesis... -------------- next part -------------- An HTML attachment was scrubbed... URL: From lm.moreira at gmail.com Thu Nov 12 14:01:17 2009 From: lm.moreira at gmail.com (Leonardo Machado Moreira) Date: Thu, 12 Nov 2009 20:01:17 -0200 Subject: [Beowulf] Cluster of Linux and Windows Message-ID: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> Hi, I am new on Clusters and have some doubts about them. I am used to work with Arch Linux. What do you think about it? And finnaly, I would like to know if Is it possible to get a Cluster Working with a Server on Arch Linux and the nodes Windows. Or even better the nodes without a defined SO. What do you think?, Does it worth? Thanks in advance. Leonardo Machado Moreira. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Fri Nov 13 01:19:35 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 13 Nov 2009 09:19:35 -0000 Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0E3C2942@milexchmb1.mil.tagmclarengroup.com> Has anyone tried out UEFI (Unified Extensible Firmware Interface) in the BIOS? The new servers I am buying come with this option in the BIOS. Not specifically UEFI, but EFI is the standard on Itanium systems, so has been in use for a long time. I use it every day. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From coutinho at dcc.ufmg.br Fri Nov 13 05:20:49 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Fri, 13 Nov 2009 11:20:49 -0200 Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> Message-ID: 2009/11/12 Leonardo Machado Moreira > Hi, I am new on Clusters and have some doubts about them. > > I am used to work with Arch Linux. What do you think about it? > > And finnaly, I would like to know if Is it possible to get a Cluster > Working with a Server on Arch Linux and the nodes Windows. > > Or even better the nodes without a defined SO. > You should have same MPI implementation on all machines (despite windows adds some network overhead) and choose a method to launch process in slave machines (ssh , mpd, etc). > > What do you think?, Does it worth? > > Thanks in advance. > > Leonardo Machado Moreira. > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Fri Nov 13 09:29:35 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 13 Nov 2009 12:29:35 -0500 (EST) Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> Message-ID: > I am used to work with Arch Linux. What do you think about it? the distro is basically irrelevant. clustering is just a matter of your apps, middleware like mpi (may or may not be provided by the cluster), probably a shared filesystem, working kernel, network stack, job-launch mechanism. distros are mainly about desktop gunk that is completely irrelevant to clusters. > And finnaly, I would like to know if Is it possible to get a Cluster Working > with a Server on Arch Linux and the nodes Windows. sure, but why? windows is generally inferior as an OS platform, so I would stay away unless you actually require your apps to run under windows. (remember that linux can use windows storage and authentication just fine.) > Or even better the nodes without a defined SO. SO=Significant Other? oh, maybe "OS". generally, you want to minimize the number of things that can go wrong in your system. using uniform OS on nodes/servers is a good start. but sure, there's no reason you can't run a cluster where every node is a different OS. they simply need to agree on the network protocol (which doesn't have to be MPI - in fact, using something more SOA-like might help if the nodes are heterogenous) From amjad11 at gmail.com Sat Nov 14 06:47:07 2009 From: amjad11 at gmail.com (amjad ali) Date: Sat, 14 Nov 2009 09:47:07 -0500 Subject: [Beowulf] Array Declaration approach difference Message-ID: <428810f20911140647g3af84a6dg78a2ad9d7399ad3e@mail.gmail.com> Hi All. I have parallel PDE/CFD code in fortran. Let we consider it consisting of two parts: 1) Startup part; that includes input reads, splits, distributions, forming neighborhood information arrays, grid arrays, and all related. It includes most of the necessary array declarations. 2) Iterative part; we proceed the solution in time. Approach One: ============ What I do is that during the Startup phase, I declare the most array allocatable and then allocate them sizes depending upon the input reads and domain partitioning. And then In the iterative phase I utilize those arrays. But I "do not" allocate/deallocate new arrays in the iterative part. Approach Two: ============ I think that, what if I first use to run only the start -up phase of my parallel code having allocatable like things and get the sizes-values required for array allocations for a specific problem size and partitioning. Then I use these values as contant in another version of my code in which I will declare array with the contant values obtained. So my question is that will there be any significant performance/efficiency diffrence in the "ITERATIVE part" if the approch two is used (having arrays declared fixed sizes/values)? Thank You for your kind attention. with best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From siegert at sfu.ca Sat Nov 14 16:43:27 2009 From: siegert at sfu.ca (Martin Siegert) Date: Sat, 14 Nov 2009 16:43:27 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes Message-ID: <20091115004327.GA12781@stikine.its.sfu.ca> Hi, I am running into problems when sending large messages (about 180000000 doubles) over IB. A fairly trivial example program is attached. # mpicc -g sendrecv.c # mpiexec -machinefile m2 -n 2 ./a.out id=1: calling irecv ... id=0: calling isend ... [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 This is with OpenMPI-1.3.3. Does anybody know a solution to this problem? If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs and never returns. I asked on the openmpi users list but got no response ... Cheers, Martin -- Martin Siegert Head, Research Computing WestGrid Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 -------------- next part -------------- A non-text attachment was scrubbed... Name: sendrecv.c Type: text/x-c++src Size: 1054 bytes Desc: not available URL: From hahn at mcmaster.ca Sun Nov 15 12:38:08 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun, 15 Nov 2009 15:38:08 -0500 (EST) Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> Message-ID: > I am running into problems when sending large messages (about > 180000000 doubles) over IB. A fairly trivial example program is attached. sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK set too low? (ulimit -l) > [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 105 looks like it might be an errno to me: #define ENOBUFS 105 /* No buffer space available */ regards, mark. From mdidomenico4 at gmail.com Sun Nov 15 14:29:13 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Sun, 15 Nov 2009 14:29:13 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> Message-ID: you might want to ask on the linux-rdma list (was openfabrics). its been awhile since i looked at IB error messages, but what stack/version are you running? On Sat, Nov 14, 2009 at 4:43 PM, Martin Siegert wrote: > Hi, > > I am running into problems when sending large messages (about > 180000000 doubles) over IB. A fairly trivial example program is attached. > > # mpicc -g sendrecv.c > # mpiexec -machinefile m2 -n 2 ./a.out > id=1: calling irecv ... > id=0: calling isend ... > [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 ?vendor error 105 qp_idx 3 > > This is with OpenMPI-1.3.3. > Does anybody know a solution to this problem? > > If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs > and never returns. > I asked on the openmpi users list but got no response ... > > Cheers, > Martin > > -- > Martin Siegert > Head, Research Computing > WestGrid Site Lead > IT Services ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?phone: 778 782-4691 > Simon Fraser University ? ? ? ? ? ? ? ? ? ?fax: ? 778 782-4242 > Burnaby, British Columbia ? ? ? ? ? ? ? ? ?email: siegert at sfu.ca > Canada ?V5A 1S6 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From lm.moreira at gmail.com Fri Nov 13 05:34:13 2009 From: lm.moreira at gmail.com (Leonardo Machado Moreira) Date: Fri, 13 Nov 2009 11:34:13 -0200 Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> Message-ID: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com> Basicaly, Is a Cluster Implementation just based on these two libraries MPI on the Server and SSH on the clients?? And a program on tcl/tk for example on server to watch the cluster? Thanks a lot. Leonardo Machado Moreira On Fri, Nov 13, 2009 at 11:20 AM, Bruno Coutinho wrote: > > > 2009/11/12 Leonardo Machado Moreira > >> Hi, I am new on Clusters and have some doubts about them. >> >> I am used to work with Arch Linux. What do you think about it? >> >> And finnaly, I would like to know if Is it possible to get a Cluster >> Working with a Server on Arch Linux and the nodes Windows. >> >> Or even better the nodes without a defined SO. >> > > You should have same MPI implementation on all machines (despite windows > adds some network overhead) and choose a method to launch process in slave > machines (ssh , mpd, etc). > > >> >> What do you think?, Does it worth? >> >> Thanks in advance. >> >> Leonardo Machado Moreira. >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zenabdin1988 at hotmail.com Sat Nov 14 04:24:16 2009 From: zenabdin1988 at hotmail.com (Zain elabedin hammade) Date: Sat, 14 Nov 2009 12:24:16 +0000 Subject: [Beowulf] mpd ..failed ..! Message-ID: Hello All. I have a cluster with 4 machines (fedora core 11). I installed mpich2 - 1.1.1-1.fc11.i586.rpm . I wrote on every machine : mpd & mpdtrace -l then i wrote on thr Master : mpd -h Worker1.cluster.net - p 56128 -n I got : Master.cluster.net_38047 (connect_lhs 944): NOT OK to enter ring; one likely cause: mismatched secretwords Master.cluster.net_38047 (enter_ring 873): lhs connect failed Master.cluster.net_38047 (run 256): failed to enter ring And the same was for other machines : Worker2 and Worker3 . For information : I have SSH works on .. So where is the problem ? What i have to do ? I really need your help . Regarded . _________________________________________________________________ Windows Live Hotmail: Your friends can get your Facebook updates, right from Hotmail?. http://www.microsoft.com/middleeast/windows/windowslive/see-it-in-action/social-network-basics.aspx?ocid=PID23461::T:WLMTAGL:ON:WL:en-xm:SI_SB_4:092009 -------------- next part -------------- An HTML attachment was scrubbed... URL: From becker at scyld.com Mon Nov 16 01:40:55 2009 From: becker at scyld.com (Donald Becker) Date: Mon, 16 Nov 2009 01:40:55 -0800 (PST) Subject: [Beowulf] Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16 2009 Message-ID: Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16 2009 11th Annual Beowulf Bash And LECCIBG 9pm November 16 2009 The Game, at the Rose Quarter http://www.xandmarketing.com/beobash09/ It will take place, as usual, with the IEEE SC Conference. Continuing with recent tradition, we holding the Beowulf Bash Monday evening just after the Opening Gala. As in previous years, the primary attraction is the conversations with other attendees. We will supplement this with musical entertainment. We will have drinks and snacks, along with a few give-aways There will be a short greeting by the sponsors about 10pm Try to be there by then. Again: Monday, November 16 2009 9-11pm (Immediately after the SC09 Opening Gala) The Game, at the Rose Quarter (Close to the Convention Center) -- Donald Becker becker at scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From atchley at myri.com Mon Nov 16 05:04:06 2009 From: atchley at myri.com (Scott Atchley) Date: Mon, 16 Nov 2009 08:04:06 -0500 Subject: [Beowulf] mpd ..failed ..! In-Reply-To: References: Message-ID: <9ABA6320-A238-49E7-B71E-C1D4D6D05391@myri.com> On Nov 14, 2009, at 7:24 AM, Zain elabedin hammade wrote: > I installed mpich2 - 1.1.1-1.fc11.i586.rpm . You should ask this on the mpich list at: https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss > I wrote on every machine : > > mpd & > mpdtrace -l You started stand-alone MPD rings of size one on each host. This is incorrect. You should use mpdboot and a machine file. $ mpdboot -f machinefile -n ... Scott From Michael.Frese at NumerEx-LLC.com Mon Nov 16 09:49:23 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Mon, 16 Nov 2009 10:49:23 -0700 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091115004327.GA12781@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> Message-ID: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> Martin, Could it be that your MPI library was compiled using a small memory model? The 180 million doubles sounds suspiciously close to a 2 GB addressing limit. This issue came up on the list recently under the topic "Fortran Array size question." Mike At 05:43 PM 11/14/2009, Martin Siegert wrote: >Hi, > >I am running into problems when sending large messages (about >180000000 doubles) over IB. A fairly trivial example program is attached. > ># mpicc -g sendrecv.c ># mpiexec -machinefile m2 -n 2 ./a.out >id=1: calling irecv ... >id=0: calling isend ... >[[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 >error polling LP CQ with status LOCAL LENGTH ERROR status number 1 >for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 > >This is with OpenMPI-1.3.3. >Does anybody know a solution to this problem? > >If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >and never returns. >I asked on the openmpi users list but got no response ... > >Cheers, >Martin > >-- >Martin Siegert >Head, Research Computing >WestGrid Site Lead >IT Services phone: 778 782-4691 >Simon Fraser University fax: 778 782-4242 >Burnaby, British Columbia email: siegert at sfu.ca >Canada V5A 1S6 > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From siegert at sfu.ca Mon Nov 16 12:56:21 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 12:56:21 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> Message-ID: <20091116205621.GB21826@stikine.its.sfu.ca> Hi Michael, On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: > Martin, > > Could it be that your MPI library was compiled using a small memory model? > The 180 million doubles sounds suspiciously close to a 2 GB addressing > limit. > > This issue came up on the list recently under the topic "Fortran Array size > question." > > > Mike I am running MPI applications that use more than 16GB of memory - I do not believe that this is the problem. Also -mmodel=large does not appear to be a valid argument for gcc under x86_64: gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 cc1: error: unrecognized command line option "-mmodel=large" - Martin > At 05:43 PM 11/14/2009, Martin Siegert wrote: >> Hi, >> >> I am running into problems when sending large messages (about >> 180000000 doubles) over IB. A fairly trivial example program is attached. >> >> # mpicc -g sendrecv.c >> # mpiexec -machinefile m2 -n 2 ./a.out >> id=1: calling irecv ... >> id=0: calling isend ... >> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error >> polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id >> 199132400 opcode 549755813 vendor error 105 qp_idx 3 >> >> This is with OpenMPI-1.3.3. >> Does anybody know a solution to this problem? >> >> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >> and never returns. >> I asked on the openmpi users list but got no response ... >> >> Cheers, >> Martin >> >> -- >> Martin Siegert >> Head, Research Computing >> WestGrid Site Lead >> IT Services phone: 778 782-4691 >> Simon Fraser University fax: 778 782-4242 >> Burnaby, British Columbia email: siegert at sfu.ca >> Canada V5A 1S6 From siegert at sfu.ca Mon Nov 16 13:01:02 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 13:01:02 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: References: <20091115004327.GA12781@stikine.its.sfu.ca> Message-ID: <20091116210102.GC21826@stikine.its.sfu.ca> On Sun, Nov 15, 2009 at 02:29:13PM -0800, Michael Di Domenico wrote: > you might want to ask on the linux-rdma list (was openfabrics). its > been awhile since i looked at IB error messages, but what > stack/version are you running? This is under Scientific Linux 5.3 which is a RH 5.3 clone that comes with OFED-1.3.2, which admittedly is quite old. Unfortunately, upgrading this is a major forklift ... thus I must be sure that this is really the problem. I'll do a few tests on a couple of nodes ... Thanks! - Martin > On Sat, Nov 14, 2009 at 4:43 PM, Martin Siegert wrote: > > Hi, > > > > I am running into problems when sending large messages (about > > 180000000 doubles) over IB. A fairly trivial example program is attached. > > > > # mpicc -g sendrecv.c > > # mpiexec -machinefile m2 -n 2 ./a.out > > id=1: calling irecv ... > > id=0: calling isend ... > > [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 ?vendor error 105 qp_idx 3 > > > > This is with OpenMPI-1.3.3. > > Does anybody know a solution to this problem? > > > > If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs > > and never returns. > > I asked on the openmpi users list but got no response ... > > > > Cheers, > > Martin > > > > -- > > Martin Siegert > > Head, Research Computing > > WestGrid Site Lead > > IT Services ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?phone: 778 782-4691 > > Simon Fraser University ? ? ? ? ? ? ? ? ? ?fax: ? 778 782-4242 > > Burnaby, British Columbia ? ? ? ? ? ? ? ? ?email: siegert at sfu.ca > > Canada ?V5A 1S6 From siegert at sfu.ca Mon Nov 16 13:24:50 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 13:24:50 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: References: <20091115004327.GA12781@stikine.its.sfu.ca> Message-ID: <20091116212450.GD21826@stikine.its.sfu.ca> Hi Mark, On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote: >> I am running into problems when sending large messages (about >> 180000000 doubles) over IB. A fairly trivial example program is attached. > > sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK > set too low? (ulimit -l) Good point. By now I have played with all kinds of ulimits (the nodes have 16GB of memory and 16GB of swap space - this program is not even coming close to those limits). This is the current setting: # ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ... same error :-( >> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 > > 105 looks like it might be an errno to me: > #define ENOBUFS 105 /* No buffer space available */ > > regards, mark. BTW: when using Intel-MPI (MPICH2) the program segfaults with l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to transfer the data internally and multiply the variable count by 8 without checking whether the integer overflows ... - Martin From gus at ldeo.columbia.edu Mon Nov 16 13:55:51 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 16 Nov 2009 16:55:51 -0500 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091116205621.GB21826@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091116205621.GB21826@stikine.its.sfu.ca> Message-ID: <4B01CA67.8020303@ldeo.columbia.edu> Hi Martin We didn't know which compiler you used. So what Michael sent you ("mmodel=memory_model") is the Intel compiler flag syntax. (PGI uses the same syntax, IIRR.) Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture. I only used this with Intel ifort, hence I am not sure, but "medium" should work fine for large data/not-so-large program in gcc/gfortran. The "large" model doesn't seem to be implemented by gcc (4.1.2) anyway. (Maybe it is there in newer gcc versions.) The darn thing is that gcc says "medium" doesn't support building shared libraries, hence you may need to build OpenMPI static libraries instead, I would guess. (Again, check this if you have a newer gcc version.) Here's an excerpt of my gcc (4.1.2) man page: -mcmodel=small Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Pro- grams can be statically or dynamically linked. This is the default code model. -mcmodel=kernel Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code. -mcmodel=medium Generate code for the medium model: The program is linked in the lower 2 GB of the address space but symbols can be located anywhere in the address space. Programs can be statically or dynamically linked, but building of shared libraries are not supported with the medium model. -mcmodel=large Generate code for the large model: This model makes no assumptions about addresses and sizes of sections. Currently GCC does not implement this model. If you are using OpenMPI, "ompi-info -config" will tell the flags used to compile it. Mine is 1.3.2 and has no explicit mcmodel flag, which according to the gcc man page should default to "small". Are you using 16GB per process or for the whole set of processes? I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Martin Siegert wrote: > Hi Michael, > > On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: >> Martin, >> >> Could it be that your MPI library was compiled using a small memory model? >> The 180 million doubles sounds suspiciously close to a 2 GB addressing >> limit. >> >> This issue came up on the list recently under the topic "Fortran Array size >> question." >> >> >> Mike > > I am running MPI applications that use more than 16GB of memory - > I do not believe that this is the problem. Also -mmodel=large > does not appear to be a valid argument for gcc under x86_64: > gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 > cc1: error: unrecognized command line option "-mmodel=large" > > - Martin > >> At 05:43 PM 11/14/2009, Martin Siegert wrote: >>> Hi, >>> >>> I am running into problems when sending large messages (about >>> 180000000 doubles) over IB. A fairly trivial example program is attached. >>> >>> # mpicc -g sendrecv.c >>> # mpiexec -machinefile m2 -n 2 ./a.out >>> id=1: calling irecv ... >>> id=0: calling isend ... >>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error >>> polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id >>> 199132400 opcode 549755813 vendor error 105 qp_idx 3 >>> >>> This is with OpenMPI-1.3.3. >>> Does anybody know a solution to this problem? >>> >>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >>> and never returns. >>> I asked on the openmpi users list but got no response ... >>> >>> Cheers, >>> Martin >>> >>> -- >>> Martin Siegert >>> Head, Research Computing >>> WestGrid Site Lead >>> IT Services phone: 778 782-4691 >>> Simon Fraser University fax: 778 782-4242 >>> Burnaby, British Columbia email: siegert at sfu.ca >>> Canada V5A 1S6 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Mon Nov 16 13:58:25 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 16 Nov 2009 16:58:25 -0500 (EST) Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091116212450.GD21826@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> <20091116212450.GD21826@stikine.its.sfu.ca> Message-ID: >>> I am running into problems when sending large messages (about >>> 180000000 doubles) over IB. A fairly trivial example program is attached. >> >> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK >> set too low? (ulimit -l) > > Good point. ... > max locked memory (kbytes, -l) unlimited ... > ... same error :-( well, at this point, I'd consider running the test program under strace. From djholm at fnal.gov Mon Nov 16 14:24:27 2009 From: djholm at fnal.gov (Don Holmgren) Date: Mon, 16 Nov 2009 16:24:27 -0600 (CST) Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091116212450.GD21826@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> <20091116212450.GD21826@stikine.its.sfu.ca> Message-ID: Be careful - ulimit's can differ between an interative shell launched with rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the environment of your batch script, and the environment of the processes launched via mpirun. I've been burned by this before. If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the ulimit environment on a PBS/Torque batch setup will be governed by the ulimits of pbs_mom, which in turn is governed by your init process and/or by any of the ulimit commands in init.d/pbs-client. The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in your mpi-launched binary and check the size. Chances are this isn't your problem, though, because usually the error messages make it pretty clear that a memory lock failure has occurred. Don Holmgren Fermilab On Mon, 16 Nov 2009, Martin Siegert wrote: > Hi Mark, > > On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote: >>> I am running into problems when sending large messages (about >>> 180000000 doubles) over IB. A fairly trivial example program is attached. >> >> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK >> set too low? (ulimit -l) > > Good point. > By now I have played with all kinds of ulimits (the nodes have 16GB > of memory and 16GB of swap space - this program is not even coming close > to those limits). This is the current setting: > # ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 139264 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) unlimited > real-time priority (-r) 0 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 139264 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > ... same error :-( > >>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >> >> 105 looks like it might be an errno to me: >> #define ENOBUFS 105 /* No buffer space available */ >> >> regards, mark. > > BTW: when using Intel-MPI (MPICH2) the program segfaults with > l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to > transfer the data internally and multiply the variable count by 8 > without checking whether the integer overflows ... > > - Martin From siegert at sfu.ca Mon Nov 16 15:27:57 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 15:27:57 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <4B01CA67.8020303@ldeo.columbia.edu> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091116205621.GB21826@stikine.its.sfu.ca> <4B01CA67.8020303@ldeo.columbia.edu> Message-ID: <20091116232757.GF21826@stikine.its.sfu.ca> Hi, On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote: > Hi Martin > > We didn't know which compiler you used. > So what Michael sent you ("mmodel=memory_model") > is the Intel compiler flag syntax. > (PGI uses the same syntax, IIRR.) Now that was really stupid, I am using gcc-4.3.2 and even looked up the correct syntax for the memory model, but nevertheless pasted the Intel syntax into my configure script ... sorry. > Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture. > I only used this with Intel ifort, hence I am not sure, > but "medium" should work fine for large data/not-so-large program > in gcc/gfortran. > The "large" model doesn't seem to be implemented by gcc (4.1.2) > anyway. > (Maybe it is there in newer gcc versions.) > The darn thing is that gcc says "medium" doesn't support building > shared libraries, > hence you may need to build OpenMPI static libraries instead, > I would guess. > (Again, check this if you have a newer gcc version.) > Here's an excerpt of my gcc (4.1.2) man page: > > > -mcmodel=small > Generate code for the small code model: the program and its > symbols must be linked in the lower 2 GB of the address space. Pointers > are 64 bits. Pro- > grams can be statically or dynamically linked. This is the > default code model. > > -mcmodel=kernel > Generate code for the kernel code model. The kernel runs in the > negative 2 GB of the address space. This model has to be used for Linux > kernel code. > > -mcmodel=medium > Generate code for the medium model: The program is linked in the > lower 2 GB of the address space but symbols can be located anywhere in the > address > space. Programs can be statically or dynamically linked, but > building of shared libraries are not supported with the medium model. > > -mcmodel=large > Generate code for the large model: This model makes no > assumptions about addresses and sizes of sections. Currently GCC does not > implement this model. I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program still fails. The error message changes, however: id=1: calling irecv ... id=0: calling isend ... mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e) [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046 vendor error 112 qp_idx 3 (strerror(112) is "Host is down", which is certainly not correct). This now points to system libraries - libmlx4. Am I correct in assuming that this is either an OFED problem or OpenMPI exceeding some buffers in OFED libraries without checking? > If you are using OpenMPI, "ompi-info -config" > will tell the flags used to compile it. > Mine is 1.3.2 and has no explicit mcmodel flag, > which according to the gcc man page should default to "small". Are you - in fact, is anybody - able to run my test program? I am hoping that there is some stupid misconfiguration on the cluster that can be fixed easily, without reinstalling/recompiling all apps ... > Are you using 16GB per process or for the whole set of processes? I am running the two processes on different nodes (and nothing else on the nodes), thus each process has the full 16GB available. > > I hope this helps, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- Thanks! - Martin > Martin Siegert wrote: >> Hi Michael, >> >> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: >>> Martin, >>> >>> Could it be that your MPI library was compiled using a small memory >>> model? The 180 million doubles sounds suspiciously close to a 2 GB >>> addressing limit. >>> >>> This issue came up on the list recently under the topic "Fortran Array >>> size question." >>> >>> >>> Mike >> >> I am running MPI applications that use more than 16GB of memory - I do not >> believe that this is the problem. Also -mmodel=large >> does not appear to be a valid argument for gcc under x86_64: >> gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 >> cc1: error: unrecognized command line option "-mmodel=large" >> >> - Martin >> >>> At 05:43 PM 11/14/2009, Martin Siegert wrote: >>>> Hi, >>>> >>>> I am running into problems when sending large messages (about >>>> 180000000 doubles) over IB. A fairly trivial example program is attached. >>>> >>>> # mpicc -g sendrecv.c >>>> # mpiexec -machinefile m2 -n 2 ./a.out >>>> id=1: calling irecv ... >>>> id=0: calling isend ... >>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 >>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for >>>> wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >>>> >>>> This is with OpenMPI-1.3.3. >>>> Does anybody know a solution to this problem? >>>> >>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >>>> and never returns. >>>> I asked on the openmpi users list but got no response ... >>>> >>>> Cheers, >>>> Martin >>>> >>>> -- >>>> Martin Siegert >>>> Head, Research Computing >>>> WestGrid Site Lead >>>> IT Services phone: 778 782-4691 >>>> Simon Fraser University fax: 778 782-4242 >>>> Burnaby, British Columbia email: siegert at sfu.ca >>>> Canada V5A 1S6 >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Martin Siegert Head, Research Computing WestGrid Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 From lindahl at pbm.com Mon Nov 16 17:20:48 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 16 Nov 2009 17:20:48 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> Message-ID: <20091117012048.GD12561@bx9.net> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: > Could it be that your MPI library was compiled using a small memory > model? The 180 million doubles sounds suspiciously close to a 2 GB > addressing limit. > > This issue came up on the list recently under the topic "Fortran Array > size question." If you need a memory model other than the default small, you'll get a particular error message at link time; here's an example courtesy of the Intel software forums, but I bet that every compiler for Linux includes an example in their manual: /tmp/ifort3X7vjE.o: In function `sph': sph.f:41: relocation truncated to fit: R_X86_64_PC32 against `.bss' sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' And it's only when your BSS is too big, not variables on the stack or allocated/malloced. I really doubt this is the problem either now or before. -- greg From siegert at sfu.ca Mon Nov 16 18:38:09 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 18:38:09 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091117012048.GD12561@bx9.net> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091117012048.GD12561@bx9.net> Message-ID: <20091117023809.GA25161@stikine.its.sfu.ca> On Mon, Nov 16, 2009 at 05:20:48PM -0800, Greg Lindahl wrote: > On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: > > > Could it be that your MPI library was compiled using a small memory > > model? The 180 million doubles sounds suspiciously close to a 2 GB > > addressing limit. > > > > This issue came up on the list recently under the topic "Fortran Array > > size question." > > If you need a memory model other than the default small, you'll get a > particular error message at link time; here's an example courtesy of > the Intel software forums, but I bet that every compiler for Linux > includes an example in their manual: > > /tmp/ifort3X7vjE.o: In function `sph': > sph.f:41: relocation truncated to fit: R_X86_64_PC32 against `.bss' > sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' > sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' > sph.f:94: relocation truncated to fit: R_X86_64_PC32 against `.bss' > > And it's only when your BSS is too big, not variables on the stack or > allocated/malloced. I really doubt this is the problem either now or > before. Thanks, that's good to know - I certainly do not see any such messages - neither with the Intel compiler nor gcc. Furthermore, compiling openmpi with mcmodel=medium or large does not make a difference. (my previous email about the error message changing was a mistake: the error message changes when l is 268435456 or larger). Also: compiling openmpi with ofed-1.4.1 does not make a difference. May I conclude that this just does not work? Or can anybody actually send an array of 180000000 doubles? - Martin From gus at ldeo.columbia.edu Mon Nov 16 19:40:51 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 16 Nov 2009 22:40:51 -0500 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091116232757.GF21826@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091116205621.GB21826@stikine.its.sfu.ca> <4B01CA67.8020303@ldeo.columbia.edu> <20091116232757.GF21826@stikine.its.sfu.ca> Message-ID: <4B021B43.5000505@ldeo.columbia.edu> Hi Martin I tried your program with the four combinations of IB and TCP/IP, mcmodel small and medium. I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium, just the program, hence this is not a very clean test. FYI, we have dual-socket quad-core AMD Opteron nodes with 16GB RAM each. OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4. When I ran on 2 nodes and 16 processes the program would always fail with segmentation fault / address not mapped on all four combinations above. However, when I ran on 2 nodes and 2 processes ( -bynode flag in use to direct each process to a separate node) then it worked over all four combinations! Here is the IB+medium stderr (you printed to stderr): id=1: calling irecv ... id=0: calling isend ... and the corresponding stdout: ... id=0: isend/irecv completed 1.954140 id=1: isend/irecv completed 4.192037 This rules out a problem with memory model, I suppose. Small is good enough for your message size, as long as there is enough RAM for all processes, MPI overhead, etc. Also, as Don Holmgren already pointed out to you, make sure your limits are properly set on the nodes. For instance, we use Torque, and we put these settings on the nodes' /etc/init.d/pbs_mom: ulimit -n 32768 ulimit -s unlimited ulimit -l unlimited Just like Don, we've been burned by this before, when using the vendor original setup. Of course these limits can be set in other ways. As a practical matter: Would it be possible/desirable to reduce the message size, splitting the huge message into several smaller ones? I know the wisdom is that one big message is better than many small ones, but here we're talking about huge, not big, and sizable, not small. Even your tiny test program takes a detectable time to run (4s+ seconds on IB, 14s+ on TCP/IP). It may be worth writing another version of it looping over smaller messages, and do some timing tests to compare with the huge message version. There may be a sweet spot for the message size vs. number of messages, I would guess. Big may not always be better. In the past a user here had a program sending very large messages (big 3D arrays). Not so big as to hit the 2GB threshold, but big enough to slow down the nodes and the cluster. Rewriting the program to loop over smaller messages (2D array slices) solved the problem. I remember other threads in the MPICH and OpenMPI mailing lists that reported difficulties with huge messages. My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Martin Siegert wrote: > Hi, > > On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote: >> Hi Martin >> >> We didn't know which compiler you used. >> So what Michael sent you ("mmodel=memory_model") >> is the Intel compiler flag syntax. >> (PGI uses the same syntax, IIRR.) > > Now that was really stupid, I am using gcc-4.3.2 and even looked up > the correct syntax for the memory model, but nevertheless pasted the > Intel syntax into my configure script ... sorry. > >> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture. >> I only used this with Intel ifort, hence I am not sure, >> but "medium" should work fine for large data/not-so-large program >> in gcc/gfortran. >> The "large" model doesn't seem to be implemented by gcc (4.1.2) >> anyway. >> (Maybe it is there in newer gcc versions.) >> The darn thing is that gcc says "medium" doesn't support building >> shared libraries, >> hence you may need to build OpenMPI static libraries instead, >> I would guess. >> (Again, check this if you have a newer gcc version.) >> Here's an excerpt of my gcc (4.1.2) man page: >> >> >> -mcmodel=small >> Generate code for the small code model: the program and its >> symbols must be linked in the lower 2 GB of the address space. Pointers >> are 64 bits. Pro- >> grams can be statically or dynamically linked. This is the >> default code model. >> >> -mcmodel=kernel >> Generate code for the kernel code model. The kernel runs in the >> negative 2 GB of the address space. This model has to be used for Linux >> kernel code. >> >> -mcmodel=medium >> Generate code for the medium model: The program is linked in the >> lower 2 GB of the address space but symbols can be located anywhere in the >> address >> space. Programs can be statically or dynamically linked, but >> building of shared libraries are not supported with the medium model. >> >> -mcmodel=large >> Generate code for the large model: This model makes no >> assumptions about addresses and sizes of sections. Currently GCC does not >> implement this model. > > I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program > still fails. The error message changes, however: > > id=1: calling irecv ... > id=0: calling isend ... > mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e) > [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046 vendor error 112 qp_idx 3 > > (strerror(112) is "Host is down", which is certainly not correct). > This now points to system libraries - libmlx4. Am I correct in assuming that > this is either an OFED problem or OpenMPI exceeding some buffers in OFED > libraries without checking? > >> If you are using OpenMPI, "ompi-info -config" >> will tell the flags used to compile it. >> Mine is 1.3.2 and has no explicit mcmodel flag, >> which according to the gcc man page should default to "small". > > Are you - in fact, is anybody - able to run my test program? I am > hoping that there is some stupid misconfiguration on the cluster > that can be fixed easily, without reinstalling/recompiling all > apps ... > >> Are you using 16GB per process or for the whole set of processes? > > I am running the two processes on different nodes (and nothing else > on the nodes), thus each process has the full 16GB available. >> I hope this helps, >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- > > Thanks! > > - Martin > >> Martin Siegert wrote: >>> Hi Michael, >>> >>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: >>>> Martin, >>>> >>>> Could it be that your MPI library was compiled using a small memory >>>> model? The 180 million doubles sounds suspiciously close to a 2 GB >>>> addressing limit. >>>> >>>> This issue came up on the list recently under the topic "Fortran Array >>>> size question." >>>> >>>> >>>> Mike >>> I am running MPI applications that use more than 16GB of memory - I do not >>> believe that this is the problem. Also -mmodel=large >>> does not appear to be a valid argument for gcc under x86_64: >>> gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 >>> cc1: error: unrecognized command line option "-mmodel=large" >>> >>> - Martin >>> >>>> At 05:43 PM 11/14/2009, Martin Siegert wrote: >>>>> Hi, >>>>> >>>>> I am running into problems when sending large messages (about >>>>> 180000000 doubles) over IB. A fairly trivial example program is attached. >>>>> >>>>> # mpicc -g sendrecv.c >>>>> # mpiexec -machinefile m2 -n 2 ./a.out >>>>> id=1: calling irecv ... >>>>> id=0: calling isend ... >>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 >>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for >>>>> wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >>>>> >>>>> This is with OpenMPI-1.3.3. >>>>> Does anybody know a solution to this problem? >>>>> >>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >>>>> and never returns. >>>>> I asked on the openmpi users list but got no response ... >>>>> >>>>> Cheers, >>>>> Martin >>>>> >>>>> -- >>>>> Martin Siegert >>>>> Head, Research Computing >>>>> WestGrid Site Lead >>>>> IT Services phone: 778 782-4691 >>>>> Simon Fraser University fax: 778 782-4242 >>>>> Burnaby, British Columbia email: siegert at sfu.ca >>>>> Canada V5A 1S6 >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From siegert at sfu.ca Mon Nov 16 21:04:07 2009 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 16 Nov 2009 21:04:07 -0800 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <4B021B43.5000505@ldeo.columbia.edu> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091116205621.GB21826@stikine.its.sfu.ca> <4B01CA67.8020303@ldeo.columbia.edu> <20091116232757.GF21826@stikine.its.sfu.ca> <4B021B43.5000505@ldeo.columbia.edu> Message-ID: <20091117050407.GA25626@stikine.its.sfu.ca> Hi Gus, On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote: > Hi Martin > > I tried your program with the four combinations of > IB and TCP/IP, mcmodel small and medium. > I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium, > just the program, hence this is not a very clean test. > > FYI, we have dual-socket quad-core AMD Opteron > nodes with 16GB RAM each. > OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4. We have dual-socket quad-core Intel E5430, 16GB, OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers, but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested OFED-1.4.1 on two test nodes). > When I ran on 2 nodes and 16 processes the program would always fail > with segmentation fault / address not mapped on all four > combinations above. > > However, when I ran on 2 nodes and 2 processes ( -bynode flag in > use to direct each process to a separate node) then it > worked over all four combinations! > > Here is the IB+medium stderr (you printed to stderr): > id=1: calling irecv ... > id=0: calling isend ... > > and the corresponding stdout: > ... > id=0: isend/irecv completed 1.954140 > id=1: isend/irecv completed 4.192037 Thanks!! Now I am surprised ... this always fails here. What's the difference? > This rules out a problem with memory model, I suppose. > Small is good enough for your message size, > as long as there is enough RAM for all processes, > MPI overhead, etc. > > Also, as Don Holmgren already pointed out to you, > make sure your limits are properly set on the nodes. > For instance, we use Torque, and we put these settings > on the nodes' /etc/init.d/pbs_mom: > > ulimit -n 32768 > ulimit -s unlimited > ulimit -l unlimited > > Just like Don, we've been burned by this before, when using the > vendor original setup. > Of course these limits can be set in other ways. I have been running this on the two test nodes without going through torque to avoid exactly these kind of problems. Anyway, I just ran the same program through torque, ran "ulimit -a" in the pbs script (all looks fine), but the program still fails. > As a practical matter: > > Would it be possible/desirable to reduce the message size, > splitting the huge message into several smaller ones? > I know the wisdom is that one big message is better > than many small ones, but here we're talking about huge, > not big, and sizable, not small. > > Even your tiny test program takes a detectable time to run > (4s+ seconds on IB, 14s+ on TCP/IP). > It may be worth writing another version of it looping over > smaller messages, > and do some timing tests to compare with the huge > message version. > There may be a sweet spot for the message size vs. number of > messages, I would guess. > Big may not always be better. > > In the past a user here had a program sending very large messages > (big 3D arrays). > Not so big as to hit the 2GB threshold, but big enough to > slow down the nodes and the cluster. > Rewriting the program to loop over smaller messages > (2D array slices) solved the problem. > I remember other threads in the MPICH and OpenMPI > mailing lists that reported difficulties with huge messages. > > My $0.02 > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- In principle, yes ... I already wrote wrapper functions myMPI_Isend, myMPI_Irecv that do exactly that. However, we are talking about one of those quantum chemistry programs: many thousands of lines ... I'd really like to avoid this. - Martin > Martin Siegert wrote: >> Hi, >> >> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote: >>> Hi Martin >>> >>> We didn't know which compiler you used. >>> So what Michael sent you ("mmodel=memory_model") >>> is the Intel compiler flag syntax. >>> (PGI uses the same syntax, IIRR.) >> >> Now that was really stupid, I am using gcc-4.3.2 and even looked up >> the correct syntax for the memory model, but nevertheless pasted the >> Intel syntax into my configure script ... sorry. >> >>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture. >>> I only used this with Intel ifort, hence I am not sure, >>> but "medium" should work fine for large data/not-so-large program >>> in gcc/gfortran. >>> The "large" model doesn't seem to be implemented by gcc (4.1.2) >>> anyway. >>> (Maybe it is there in newer gcc versions.) >>> The darn thing is that gcc says "medium" doesn't support building >>> shared libraries, >>> hence you may need to build OpenMPI static libraries instead, >>> I would guess. >>> (Again, check this if you have a newer gcc version.) >>> Here's an excerpt of my gcc (4.1.2) man page: >>> >>> >>> -mcmodel=small >>> Generate code for the small code model: the program and its >>> symbols must be linked in the lower 2 GB of the address space. Pointers >>> are 64 bits. Pro- >>> grams can be statically or dynamically linked. This is the >>> default code model. >>> >>> -mcmodel=kernel >>> Generate code for the kernel code model. The kernel runs in >>> the negative 2 GB of the address space. This model has to be used for >>> Linux kernel code. >>> >>> -mcmodel=medium >>> Generate code for the medium model: The program is linked in >>> the lower 2 GB of the address space but symbols can be located anywhere >>> in the address >>> space. Programs can be statically or dynamically linked, but >>> building of shared libraries are not supported with the medium model. >>> >>> -mcmodel=large >>> Generate code for the large model: This model makes no >>> assumptions about addresses and sizes of sections. Currently GCC does >>> not implement this model. >> >> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program >> still fails. The error message changes, however: >> >> id=1: calling irecv ... >> id=0: calling isend ... >> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e) >> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046 vendor error 112 qp_idx 3 >> >> (strerror(112) is "Host is down", which is certainly not correct). >> This now points to system libraries - libmlx4. Am I correct in assuming that >> this is either an OFED problem or OpenMPI exceeding some buffers in OFED >> libraries without checking? >> >>> If you are using OpenMPI, "ompi-info -config" >>> will tell the flags used to compile it. >>> Mine is 1.3.2 and has no explicit mcmodel flag, >>> which according to the gcc man page should default to "small". >> >> Are you - in fact, is anybody - able to run my test program? I am >> hoping that there is some stupid misconfiguration on the cluster >> that can be fixed easily, without reinstalling/recompiling all >> apps ... >> >>> Are you using 16GB per process or for the whole set of processes? >> >> I am running the two processes on different nodes (and nothing else >> on the nodes), thus each process has the full 16GB available. >>> I hope this helps, >>> Gus Correa >>> --------------------------------------------------------------------- >>> Gustavo Correa >>> Lamont-Doherty Earth Observatory - Columbia University >>> Palisades, NY, 10964-8000 - USA >>> --------------------------------------------------------------------- >> >> Thanks! >> >> - Martin >> >>> Martin Siegert wrote: >>>> Hi Michael, >>>> >>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: >>>>> Martin, >>>>> >>>>> Could it be that your MPI library was compiled using a small memory >>>>> model? The 180 million doubles sounds suspiciously close to a 2 GB >>>>> addressing limit. >>>>> >>>>> This issue came up on the list recently under the topic "Fortran Array >>>>> size question." >>>>> >>>>> >>>>> Mike >>>> I am running MPI applications that use more than 16GB of memory - I do >>>> not believe that this is the problem. Also -mmodel=large >>>> does not appear to be a valid argument for gcc under x86_64: >>>> gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 >>>> cc1: error: unrecognized command line option "-mmodel=large" >>>> >>>> - Martin >>>> >>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote: >>>>>> Hi, >>>>>> >>>>>> I am running into problems when sending large messages (about >>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached. >>>>>> >>>>>> # mpicc -g sendrecv.c >>>>>> # mpiexec -machinefile m2 -n 2 ./a.out >>>>>> id=1: calling irecv ... >>>>>> id=0: calling isend ... >>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 >>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for >>>>>> wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >>>>>> >>>>>> This is with OpenMPI-1.3.3. >>>>>> Does anybody know a solution to this problem? >>>>>> >>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >>>>>> and never returns. >>>>>> I asked on the openmpi users list but got no response ... >>>>>> >>>>>> Cheers, >>>>>> Martin >>>>>> >>>>>> -- >>>>>> Martin Siegert >>>>>> Head, Research Computing >>>>>> WestGrid Site Lead >>>>>> IT Services phone: 778 782-4691 >>>>>> Simon Fraser University fax: 778 782-4242 >>>>>> Burnaby, British Columbia email: siegert at sfu.ca >>>>>> Canada V5A 1S6 >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> -- Martin Siegert Head, Research Computing WestGrid Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 From gus at ldeo.columbia.edu Mon Nov 16 22:26:52 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 17 Nov 2009 01:26:52 -0500 Subject: [Beowulf] MPI_Isend/Irecv failure for IB and large message sizes In-Reply-To: <20091117050407.GA25626@stikine.its.sfu.ca> References: <20091115004327.GA12781@stikine.its.sfu.ca> <6.2.5.6.2.20091116104423.0679eb60@NumerEx-LLC.com> <20091116205621.GB21826@stikine.its.sfu.ca> <4B01CA67.8020303@ldeo.columbia.edu> <20091116232757.GF21826@stikine.its.sfu.ca> <4B021B43.5000505@ldeo.columbia.edu> <20091117050407.GA25626@stikine.its.sfu.ca> Message-ID: <4B02422C.9080308@ldeo.columbia.edu> Hi Martin Answers/comments inline below Martin Siegert wrote: > Hi Gus, > > On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote: >> Hi Martin >> >> I tried your program with the four combinations of >> IB and TCP/IP, mcmodel small and medium. >> I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium, >> just the program, hence this is not a very clean test. >> >> FYI, we have dual-socket quad-core AMD Opteron >> nodes with 16GB RAM each. >> OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4. > > We have dual-socket quad-core Intel E5430, 16GB, > OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers, > but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested > OFED-1.4.1 on two test nodes). > >> When I ran on 2 nodes and 16 processes the program would always fail >> with segmentation fault / address not mapped on all four >> combinations above. >> >> However, when I ran on 2 nodes and 2 processes ( -bynode flag in >> use to direct each process to a separate node) then it >> worked over all four combinations! >> >> Here is the IB+medium stderr (you printed to stderr): >> id=1: calling irecv ... >> id=0: calling isend ... >> >> and the corresponding stdout: >> ... >> id=0: isend/irecv completed 1.954140 >> id=1: isend/irecv completed 4.192037 > > Thanks!! > Now I am surprised ... this always fails here. > What's the difference? > The software stack is not the same, neither the hardware. But I would guess they are not so far apart to make the difference. Have you tried to run on TCP/IP? Say, using: -mca btl tcp,sm,self \ and perhaps -mca btl_tcp_if_exclude lo,eth[0,1] or -mca btl_tcp_if_include eth[0,1] to select the Ethernet port? I would guess you have at least one Ethernet network to test the program over TCP/IP. If it works on TCP/IP, then the problem is likely to reside within IB. (Maybe in OFED-1.3.2?) >> This rules out a problem with memory model, I suppose. >> Small is good enough for your message size, >> as long as there is enough RAM for all processes, >> MPI overhead, etc. >> >> Also, as Don Holmgren already pointed out to you, >> make sure your limits are properly set on the nodes. >> For instance, we use Torque, and we put these settings >> on the nodes' /etc/init.d/pbs_mom: >> >> ulimit -n 32768 >> ulimit -s unlimited >> ulimit -l unlimited >> >> Just like Don, we've been burned by this before, when using the >> vendor original setup. >> Of course these limits can be set in other ways. > > I have been running this on the two test nodes without going through > torque to avoid exactly these kind of problems. > Anyway, I just ran the same program through torque, ran "ulimit -a" > in the pbs script (all looks fine), but the program still fails. > >> As a practical matter: >> >> Would it be possible/desirable to reduce the message size, >> splitting the huge message into several smaller ones? >> I know the wisdom is that one big message is better >> than many small ones, but here we're talking about huge, >> not big, and sizable, not small. >> >> Even your tiny test program takes a detectable time to run >> (4s+ seconds on IB, 14s+ on TCP/IP). >> It may be worth writing another version of it looping over >> smaller messages, >> and do some timing tests to compare with the huge >> message version. >> There may be a sweet spot for the message size vs. number of >> messages, I would guess. >> Big may not always be better. >> >> In the past a user here had a program sending very large messages >> (big 3D arrays). >> Not so big as to hit the 2GB threshold, but big enough to >> slow down the nodes and the cluster. >> Rewriting the program to loop over smaller messages >> (2D array slices) solved the problem. >> I remember other threads in the MPICH and OpenMPI >> mailing lists that reported difficulties with huge messages. >> >> My $0.02 >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- > > In principle, yes ... I already wrote wrapper functions > myMPI_Isend, myMPI_Irecv that do exactly that. > However, we are talking about one of those quantum chemistry > programs: many thousands of lines ... I'd really like to avoid > this. > > - Martin > A few days ago somebody posted here a tip on how to run VASP in a more scalable/efficient way by just choosing some internal code parameters (probably available through a mere namelist). This was after a long discussion here on how to make VASP more scalable by tweaking with OpenMPI MCA parameters, etc, etc. Would your user be willing to take a look at the code documentation and find out if there is a way to decompose his domain, or matrix, or problem, or whatever, in a more sensible (and hopefully scalable) way? Often times there is. These programs are not necessarily poorly designed, but users need read the documentation (or articles about the method) to find out how to use them right. A knowledgeable user should understand what the mathematical method and the algorithm are doing, or at least be willing to learn the basics of them. Unless the problem itself is huge, passing an array of 180 million doubles doesn't sound reasonable,just a brute force approach, particularly if only two processes are sharing the work, if you don't mind my saying that. And if the problem is huge, one could argue that more nodes/processes and smaller messages could be used to get the job done better. We're mostly a climate, atmosphere, ocean shop, but this doesn't mean that we are protected from this type of problem either. Just a suggestion. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- >> Martin Siegert wrote: >>> Hi, >>> >>> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote: >>>> Hi Martin >>>> >>>> We didn't know which compiler you used. >>>> So what Michael sent you ("mmodel=memory_model") >>>> is the Intel compiler flag syntax. >>>> (PGI uses the same syntax, IIRR.) >>> Now that was really stupid, I am using gcc-4.3.2 and even looked up >>> the correct syntax for the memory model, but nevertheless pasted the >>> Intel syntax into my configure script ... sorry. >>> >>>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture. >>>> I only used this with Intel ifort, hence I am not sure, >>>> but "medium" should work fine for large data/not-so-large program >>>> in gcc/gfortran. >>>> The "large" model doesn't seem to be implemented by gcc (4.1.2) >>>> anyway. >>>> (Maybe it is there in newer gcc versions.) >>>> The darn thing is that gcc says "medium" doesn't support building >>>> shared libraries, >>>> hence you may need to build OpenMPI static libraries instead, >>>> I would guess. >>>> (Again, check this if you have a newer gcc version.) >>>> Here's an excerpt of my gcc (4.1.2) man page: >>>> >>>> >>>> -mcmodel=small >>>> Generate code for the small code model: the program and its >>>> symbols must be linked in the lower 2 GB of the address space. Pointers >>>> are 64 bits. Pro- >>>> grams can be statically or dynamically linked. This is the >>>> default code model. >>>> >>>> -mcmodel=kernel >>>> Generate code for the kernel code model. The kernel runs in >>>> the negative 2 GB of the address space. This model has to be used for >>>> Linux kernel code. >>>> >>>> -mcmodel=medium >>>> Generate code for the medium model: The program is linked in >>>> the lower 2 GB of the address space but symbols can be located anywhere >>>> in the address >>>> space. Programs can be statically or dynamically linked, but >>>> building of shared libraries are not supported with the medium model. >>>> >>>> -mcmodel=large >>>> Generate code for the large model: This model makes no >>>> assumptions about addresses and sizes of sections. Currently GCC does >>>> not implement this model. >>> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program >>> still fails. The error message changes, however: >>> >>> id=1: calling irecv ... >>> id=0: calling isend ... >>> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e) >>> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046 vendor error 112 qp_idx 3 >>> >>> (strerror(112) is "Host is down", which is certainly not correct). >>> This now points to system libraries - libmlx4. Am I correct in assuming that >>> this is either an OFED problem or OpenMPI exceeding some buffers in OFED >>> libraries without checking? >>> >>>> If you are using OpenMPI, "ompi-info -config" >>>> will tell the flags used to compile it. >>>> Mine is 1.3.2 and has no explicit mcmodel flag, >>>> which according to the gcc man page should default to "small". >>> Are you - in fact, is anybody - able to run my test program? I am >>> hoping that there is some stupid misconfiguration on the cluster >>> that can be fixed easily, without reinstalling/recompiling all >>> apps ... >>> >>>> Are you using 16GB per process or for the whole set of processes? >>> I am running the two processes on different nodes (and nothing else >>> on the nodes), thus each process has the full 16GB available. >>>> I hope this helps, >>>> Gus Correa >>>> --------------------------------------------------------------------- >>>> Gustavo Correa >>>> Lamont-Doherty Earth Observatory - Columbia University >>>> Palisades, NY, 10964-8000 - USA >>>> --------------------------------------------------------------------- >>> Thanks! >>> >>> - Martin >>> >>>> Martin Siegert wrote: >>>>> Hi Michael, >>>>> >>>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote: >>>>>> Martin, >>>>>> >>>>>> Could it be that your MPI library was compiled using a small memory >>>>>> model? The 180 million doubles sounds suspiciously close to a 2 GB >>>>>> addressing limit. >>>>>> >>>>>> This issue came up on the list recently under the topic "Fortran Array >>>>>> size question." >>>>>> >>>>>> >>>>>> Mike >>>>> I am running MPI applications that use more than 16GB of memory - I do >>>>> not believe that this is the problem. Also -mmodel=large >>>>> does not appear to be a valid argument for gcc under x86_64: >>>>> gcc -DNDEBUG -g -fPIC -mmodel=large conftest.c >&5 >>>>> cc1: error: unrecognized command line option "-mmodel=large" >>>>> >>>>> - Martin >>>>> >>>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I am running into problems when sending large messages (about >>>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached. >>>>>>> >>>>>>> # mpicc -g sendrecv.c >>>>>>> # mpiexec -machinefile m2 -n 2 ./a.out >>>>>>> id=1: calling irecv ... >>>>>>> id=0: calling isend ... >>>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 >>>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for >>>>>>> wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3 >>>>>>> >>>>>>> This is with OpenMPI-1.3.3. >>>>>>> Does anybody know a solution to this problem? >>>>>>> >>>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs >>>>>>> and never returns. >>>>>>> I asked on the openmpi users list but got no response ... >>>>>>> >>>>>>> Cheers, >>>>>>> Martin >>>>>>> >>>>>>> -- >>>>>>> Martin Siegert >>>>>>> Head, Research Computing >>>>>>> WestGrid Site Lead >>>>>>> IT Services phone: 778 782-4691 >>>>>>> Simon Fraser University fax: 778 782-4242 >>>>>>> Burnaby, British Columbia email: siegert at sfu.ca >>>>>>> Canada V5A 1S6 >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jlforrest at berkeley.edu Tue Nov 17 10:11:27 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 17 Nov 2009 10:11:27 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? Message-ID: <4B02E74F.8050103@berkeley.edu> Let's say you have a brand new cluster with brand new Infiniband hardware, and that you've installed OFED 1.4 and the appropriate drivers for your IB HCAs (i.e. you see ib0 devices on the frontend and all compute nodes). The cluster appears to be working fine but you're not sure about IB. How would you test your IB network to make sure all is well? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From bill at cse.ucdavis.edu Tue Nov 17 10:33:17 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 17 Nov 2009 10:33:17 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B02E74F.8050103@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> Message-ID: <4B02EC6D.60107@cse.ucdavis.edu> Jon Forrest wrote: > Let's say you have a brand new cluster with > brand new Infiniband hardware, and that > you've installed OFED 1.4 and the > appropriate drivers for your IB > HCAs (i.e. you see ib0 devices > on the frontend and all compute nodes). > The cluster appears to be working > fine but you're not sure about IB. > > How would you test your IB network > to make sure all is well? My first suggest sanity test would be to test latency and bandwidth to insure you are getting IB numbers. So 80-100MB/sec and 30-60us for a small packet would imply GigE. 6-8 times the bandwidth certainly would imply SDR or better. Latency varies quite a bit among implementation, I'd try to get within 30-40% of advertised latency numbers. Then I'd try a workload that kept all nodes busy with something communications intensive. Pathscale has a mpi_nxnlatbw which works reasonable well to identify ports/nodes that are are slower than expected. After that works I'd suggest a production MPI work load with a known answer. From jlforrest at berkeley.edu Tue Nov 17 10:58:43 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 17 Nov 2009 10:58:43 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B02EC6D.60107@cse.ucdavis.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> Message-ID: <4B02F263.9030607@berkeley.edu> Bill Broadley wrote: > My first suggest sanity test would be to test latency and bandwidth to insure > you are getting IB numbers. So 80-100MB/sec and 30-60us for a small packet > would imply GigE. 6-8 times the bandwidth certainly would imply SDR or > better. Latency varies quite a bit among implementation, I'd try to get > within 30-40% of advertised latency numbers. For those of us who aren't familiar with IB utilities, could you give some examples of the commands you'd use to do this? Thanks, Jon From agshew at gmail.com Tue Nov 17 11:21:05 2009 From: agshew at gmail.com (Andrew Shewmaker) Date: Tue, 17 Nov 2009 12:21:05 -0700 Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS In-Reply-To: References: Message-ID: On Thu, Nov 12, 2009 at 5:47 PM, Rahul Nabar wrote: > Has anyone tried out UEFI (Unified Extensible Firmware Interface) in > the BIOS? The new servers I am buying come with this option in the > BIOS. Out of curiosity I googled it up. > > I am not sure if there were any HPC implications of this and wanted to > double check before I switched to this from my conventional > plain-vanilla BIOS. Any sort of "industry standard" always sounds good > but I thought it safer to check on the group first.... > > Any advice or pitfalls? Here's something on EFI I wrote up for myself in 2005. It's a bit out of date, but it covers some stuff that wikipedia doesn't. In particular, I would read the old Kernel Traffic to understand how various developers dislike EFI. And in case you are wondering, this post looks a bit different because it is in moinmoin wiki markup. == Firmware Awareness == You may have heard of Intel's EFI, but have wondered how does it compare to legacy BIOSes, Open Firmware, and LinuxBIOS. Here's some stuff you might want to be aware of. === Acronyms === ACPI - Advanced Configuration and Power Interface EBC - EFI Byte Code EFI - Extensible Firmware Interface UEFI Forum - The Unified EFI Forum is a group of companies (all of the big PC players) responsible for the devolopment and promotion of EFI LinuxBIOS - small, fast open source alternative to proprietary PC BIOSes OpenBIOS - open source Open Firmware implementation Open Firmware - defined by IEEE-1275 and used by Sun Microsystems (since 1988), IBM, and Apple to initialize hardware and boot Operating Systems in a largely hardware-independent manner === UEFI Forum === The UEFI Forum just announced its existence, and it looks like Intel has convinced the major PC vendors and their rival AMD to adopt EFI as a replacement for legacy BIOSes. It looks like we will be seeing EFI everywhere. === Overview of EFI === [UEFI] Q: Does UEFI completely replace a PC BIOS? A: No. While UEFI uses a different interface for boot services and runtime services, some platform firmware must perform the functions BIOS uses for system configuration (a.k.a. Power On Self Test or POST) and Setup. UEFI does not specify how POST & Setup are implemented. Q: How is UEFI implemented on a computer system? A: UEFI is an interface. It can be implemented on top of a traditional BIOS (in which case it supplants the traditional INT entry points into BIOS) or on top of non-BIOS implementations. [Singh] In a representative EFI system, a thin Pre-EFI Initialization Layer (PEI) might do most of the POST-related work that is traditionally done by the BIOS POST. This includes things like chipset initialization, memory initialization, bus enumeration, etc. EFI prepares a Driver Execution Environment (DXE) to provide generic platform functions that EFI drivers may use. The drivers themselves provide specific platform capabilities and customizations. ... Andrew Fish invented EFI at his desk in the late 1990s, calling it Intel Boot Initiative (IBI) at that time. He offered his 26 page unsolicited white paper to his management. The paper was meant to be a response to major operating system and hardware companies rejecting legacy BIOS as the firmware for enterprise class Itanium? Processor Platforms. Andrew says: "At that time, two firmware solutions were put on the table as replacements for BIOS architectures for the Itanium: Alpha Reference Console (ARC) and Open Firmware. It turned out that nobody really owned the inellectual property to ARC, and in any case, it did not have enough extensible properties to make it practical for a horizontal industry. At this point, Open Firmware became the frontrunner as Apple and Sun both used it. However, Open Firmware was not without its own technical challenges. The PC had started down a path of using the Advanced Configuration and Power Interface (ACPI) as its runtime namespace to describe the platform to the operating system. As I liked to say at the time, the only thing worse than one namespace is keeping two namespaces in sync. The other problem was the lack of third party support for Open Firmware. We invited the FirmWorks guys to come visit us at Dupont (WA), and we had a great talk. Given we had just gone through an exercise of inventing a firmware base from scratch, I think we were uniquely qualified to appreciate what Open Firmware had been able to achieve. Unfortunately, it became clear that the infrastructure to support a transition to Open Firmware did not exist. Given the namespace issue with Open Firmware and the lack of industry enabling infrastructure, we decided to go on and make EFI a reality." "EFI is an interface specification and it really is more about how to write an operating system loader and an Option ROM than it is about how to make a BIOS that works. The Intel? Platform Innovation Framework for EFI (Framework for short) is Intel's next generation firmware architecture from the ground up. The core chunks of this code are available under an Open Source license at www.TianoCore.org. Tiano was the developer code name while Framework was the marketing name." "To sum up, EFI is an industry interface specification that defines how OS loaders and PCI Option ROMs work. The Framework defines a new modular architecture that allows an entire firmware base to be constructed in a modular fashion. The Framework has a nice property in that it allows binary modules to work together in the boot process. This allows the code from each vendor to have an arbitrary license type. Intel? was interested in EFI from making a standard Itanium platform (as well as IA-32 platforms of the future) to drive adoption and enable a horizontal industry to make compatible platforms. The Framework is more about silicon enabling, so it drills down to a much lower level of how things work." Fish says that PCs had already started down the path of ACPI and that the Open Firmware namespace was incompatible. Evidently when ACPI was developed, the already existent Open Firmware specification/namespace was ignored. [Intel] 1992 APM 1.0 1993 APM 1.1, APM Energy Star 1994 1995 PCI Mobile Design Guide 1.1, PCI 2.1 1996 APM 1.2 1997 ACPI 1.0, PCI PM 1.0 1998 PCI 2.2 1999 ACPI 1.0b 2000 ACPI 2.0 Figure 2.1 PC Power Management Specification Timeline === Linus Torvalds comments on EFI === Linus Torvalds. [Brown] EFI is doing all the wrong things. Trying to fix BIOSes by being "more generic". It's going to be a total nightmare if you go down that path. What will work is: * standard hardware interfaces. Instead of working on bytecode interpreters, make the f*cking hardware definition instead, and make it SANE and PUBLIC! So that we can write drivers that work, and that come with source so that we can fix them when somebody has buggy hardware. DO NOT MAKE ANOTHER FRIGGING BYTECODE INTERPRETER! Didn't Intel learn anything from past mistakes? ACPI was supposed to be "simple". Codswallop. PCI works, because it had standard, and documented, hardware interfaces. The interfaces aren't well specified enough to write a PCI disk driver, of course, but they _are_ good enough to do discovery and a lot of things. Intel _could_ make a "PCI disk controller interface definition", and it will work. The way USB does actually work, and UHCI was actually a fair standard, even if it left _way_ too much to software. * Source code. LinuxBIOS works today, and is a lot more flexible than EFI will _ever_ be. * Compatibility. Make hardware that works with old drivers and old BIOSes. This works. The fact that Intel forgot about that with ia-64 is not an excuse to make _more_ mistakes. === Intel's Reply to Linus === Mark Doran of Intel. [Brown] The trouble with the "architectural hardware" argument proved to be that PCI is already well established and there is a vibrant industry churning out innovative PCI cards on a regular basis. The idea of a single interface definition for all cards of each of the network, storage or video classes is viewed as simply too limiting and the argument was made to us that to force such a model would be to stifle innovation in peripherals. So effectively the feedback we got on "architectural hardware" was therefore along the lines of "good idea but not practical..." ... As a practical matter carrying multiple instruction set versions of the same code gets expensive in FLASH memory terms. Consider an EFI compiled driver for IA-32 as the index, size: one unit. With code size expansion, an Itanium compiled driver is going to be three to four times that size. Total ROM container requirement: one unit for the legacy ROM image plus one for an EFI IA-32 driver plus three to four units for an Itanium compiled driver image; to make the card "just work" when you plug it into a variety of systems is starting to require a lot of FLASH on the card. More than the IHVs were willing to countenance in most cases for cost reasons. EFI Byte Code was born of this challenge. Its goals are pretty straightforward: architecture neutral image, small foot print in the add-in card ROM container and of course small footprint in the motherboard which will have to carry an interpreter. We also insisted that the C source for a driver should be the same regardless of whether you build it for a native machine instruction set or EBC. ... You may ask why we didn't just use an existing definition as opposed to making a new one. We did actually spend quite a bit of time on that very question. Most alternatives would have significantly swelled the ROM container size requirement or the motherboard support overhead requirement or had licensing, IP or other impediments to deployment into the wider industry that we had no practical means to resolve. With specific reference to why we chose not to use the IA-32 instruction set for this purpose, it was all about the size of an interpreter for that instruction set. To provide compatibility 100% for the universe of real mode option ROM binaries out there would require a comprehensive treatment of a very rich instruction set architecture. We could see no practical way to persuade OEMs building systems using processors other than IA-32 to carry along that much interpreter code in their motherboard ROM. ... ... EBC requires a small interpreter with no libraries (roughly 18k uncompressed total on IA-32 for example) and the average add-in card ROM image size is 1.5 units relative to native IA-32 code. And keep in mind that using byte code for this purpose is in widespread, long time use on other CPU architectures so we felt the technique in general was viable based on industry experience with it. Yes, it's a compromise but the best balance point we have been able find to date. ... There is nothing about the definition of the EFI spec or the driver model associated that prevents vendors from making add-in card drivers and presenting them in Open Source form to the community. In fact we've specifically included the ability to "late bind" a driver into a system that speaks EFI. In practice that late binding means that code that uses EFI services and that is GPL code can be used on systems that also include EFI code that is not open source. The decision on whether to make any given driver Open Source or not therefore lies with the creator of that code. In the case of ROM content for an add-in card that will usually be the IHV that makes the card. ... The patent license grant is thus in some sense a double coverage approach...you don't really need a patent license grant since there aren't any patents that read but to reinforce that you don't need to worry about patents we give you the grant anyway. This helped make some corporate entities more comfortable about implementing support for EFI. === Comparing EBC to x86 bytecode === However, the EBC interpreter isn't that much smaller than an x86 emulator. Add in the fact that x86 bytecode generation is much more common and proven, and it looks like EBC isn't as big a win as Intel believes. [Lo] In this paper we present our preliminary results on FreeVGA, an x86 emulator based on x86emu that can be used as such a compatibility layer. We will show how we have successfully used FreeVGA to initialize VGA cards from both ATI and Nvidia on a Tyan S2885 platform. ... Integrating FreeVGA into LinuxBIOS had virtually no impact on the size of the resulting ROM image. The compressed ROM image only increased by 16KB, but because the final ROM image is padded to the nearest power of 2, this increase was absorbed into the existing unused space. The runtime size of the uncompressed image was only increased by 40KB. === LinuxBIOS === Since EFI doesn't deal with POST or setup, it could actually sit on top of something like LinuxBIOS. In fact, the OpenBIOS project is already planning on putting their Open Firmware on top of LinuxBIOS. Certainly, if LinuxBIOS can use an x86 emulator, it can use a Forth or EBC interpreter. Interestingly, EFI might be a good way to have more vendors use LinuxBIOS in their products. How much can vendors differentiate themselves in POST and Setup? Probably not much. They will want to differentiate themselves in what they put on top of the abstraction layer. If that is the case, then it would be in everyone's best interest to adopt LinuxBIOS for the boring, unprofitable POST and Setup for cost sharing reasons. The romcc compiler developed for the LinuxBIOS project is another good reason to use it because it reduces the amount of assembly code needed to initialize a machine. [Minnich] In 2002, Eric Biederman of Linux NetworX developed a compiler called romcc. romcc is a simple optimizing C compiler-one file, 25,043 lines of code-that uses only registers, not memory. The compiler can use extended register sets such as MMX, SSI or 3DNOW. romcc allowed us to junk almost all of the assembly code in LinuxBIOS, so that even the earliest code, run with no working DRAM, can be written in C. romcc is used only for early, pre-memory code. For code that runs after memory comes up, we use GCC. === References === Brown, Zack Kernel Traffic #231 For 10 Sep 2003. http://web.archive.org/web/20030926022111/http://www.kerneltraffic.org/kernel-traffic/kt20030910_231.html#7 Intel. Power Management History and Motivation. http://www.intel.com/intelpress/samples/ppm_chapter.pdf Lo, Li-Ta; Watson, Gregory R.; Minnich, Ronald G. FreeVGA: Architecture Independent Video Graphics Initialization for LinuxBIOS. http://www.linuxbios.org/data/vgabios/ Minnich, Ronald G. Porting LinuxBIOS to the AMD SC520. http://www.linuxjournal.com/article/8120 Singh, Amit. More Power to Firmware. http://kernelthread.com/publications/firmware/ UEFI. About UEFI. http://www.uefi.org/about.asp -- Andrew Shewmaker From w.a.sellers at nasa.gov Mon Nov 16 05:44:50 2009 From: w.a.sellers at nasa.gov (Sellers, William A. (LARC-D205)[NCI]) Date: Mon, 16 Nov 2009 07:44:50 -0600 Subject: [Beowulf] mpd ..failed ..! In-Reply-To: References: Message-ID: You need to create a .mpd.conf file in your home directory - ~/.mpd.conf and it must contain a line: MPD_SECRETWORD=change-me-to-something-else The file must be mode 600 or it will not work. Make sure this file is shared among all the nodes. Also I suggest using mpdboot from the first node instead of invoking mpd directly. I've had better success with mpdboot. Regards, Bill ________________________________________ From: beowulf-bounces at beowulf.org [beowulf-bounces at beowulf.org] On Behalf Of Zain elabedin hammade [zenabdin1988 at hotmail.com] Sent: Saturday, November 14, 2009 7:24 AM To: beowulf at beowulf.org Subject: [Beowulf] mpd ..failed ..! Hello All. I have a cluster with 4 machines (fedora core 11). I installed mpich2 - 1.1.1-1.fc11.i586.rpm . I wrote on every machine : mpd & mpdtrace -l then i wrote on thr Master : mpd -h Worker1.cluster.net - p 56128 -n I got : Master.cluster.net_38047 (connect_lhs 944): NOT OK to enter ring; one likely cause: mismatched secretwords Master.cluster.net_38047 (enter_ring 873): lhs connect failed Master.cluster.net_38047 (run 256): failed to enter ring And the same was for other machines : Worker2 and Worker3 . For information : I have SSH works on .. So where is the problem ? What i have to do ? I really need your help . Regarded . ________________________________ Windows Live Hotmail: Your friends can get your Faceb! ook updates, right from Hotmail?. From sabujp at gmail.com Tue Nov 17 11:10:55 2009 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 17 Nov 2009 13:10:55 -0600 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B02EC6D.60107@cse.ucdavis.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> Message-ID: Hi, The OMB package for mvapich2 (mpich2 for IB) has some great programs that you can use to test to see if your IB network is working properly. Here are some of my results on our QDR IB network: % mpiexec -n 2 ./osu_latency # OSU MPI Latency Test v3.1.1 # Size Latency (us) 0 1.87 1 1.95 2 1.97 4 1.98 8 1.98 16 1.99 32 2.04 64 2.19 128 3.63 256 3.91 512 4.38 1024 5.25 2048 6.80 4096 8.12 8192 10.94 16384 16.35 32768 22.13 65536 33.28 131072 55.09 262144 100.01 524288 166.54 1048576 333.60 2097152 636.91 4194304 1252.71 % mpiexec -n 2 ./osu_bw # OSU MPI Bandwidth Test v3.1.1 # Size Bandwidth (MB/s) 1 1.44 2 2.79 4 5.51 8 11.18 16 21.17 32 43.42 64 82.41 128 146.97 256 314.42 512 564.18 1024 1033.38 2048 1634.33 4096 2168.96 8192 2514.58 16384 2788.07 32768 3038.48 65536 3213.89 131072 3293.78 262144 3334.07 524288 3353.60 1048576 3355.25 2097152 3362.15 4194304 3365.81 That's 3.3GB/s or ~26.4gbps . HTH, Sabuj Pattanayek On Tue, Nov 17, 2009 at 12:33 PM, Bill Broadley wrote: > Jon Forrest wrote: >> Let's say you have a brand new cluster with >> brand new Infiniband hardware, and that >> you've installed OFED 1.4 and the >> appropriate drivers for your IB >> HCAs (i.e. you see ib0 devices >> on the frontend and all compute nodes). >> The cluster appears to be working >> fine but you're not sure about IB. >> >> How would you test your IB network >> to make sure all is well? From angelv at iac.es Tue Nov 17 01:44:43 2009 From: angelv at iac.es (=?ISO-8859-1?Q?=C1ngel_de_Vicente?=) Date: Tue, 17 Nov 2009 09:44:43 +0000 Subject: [Beowulf] Step by step guide for the installation and configuration of a cluster (with Rocks) to run ParaView Message-ID: <4B02708B.2070908@iac.es> Hi all, we have recently installed a small test cluster to run ParaView visualization software in parallel. The configuration was not trivial, and I put a detailed step-by-step guide in http://www.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.ParaviewInACluster, in case it can be of interest to someone else. Any comments or suggestions are welcome. Cheers, ?ngel de Vicente -- +---------------------------------------------+ | | | http://www.iac.es/galeria/angelv/ | | | | High Performance Computing Support PostDoc | | Instituto de Astrof?sica de Canarias | | | +---------------------------------------------+ --------------------------------------------------------------------------------------------- ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protecci?n de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en From bill at cse.ucdavis.edu Tue Nov 17 14:46:43 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 17 Nov 2009 14:46:43 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B02F263.9030607@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> Message-ID: <4B0327D3.2070603@cse.ucdavis.edu> Jon Forrest wrote: > Bill Broadley wrote: > >> My first suggest sanity test would be to test latency and bandwidth to >> insure >> you are getting IB numbers. So 80-100MB/sec and 30-60us for a small >> packet >> would imply GigE. 6-8 times the bandwidth certainly would imply SDR or >> better. Latency varies quite a bit among implementation, I'd try to get >> within 30-40% of advertised latency numbers. > > For those of us who aren't familiar with IB utilities, > could you give some examples of the commands you'd use > to do this? > > Thanks, > Jon Here's 2 that I use: http://cse.ucdavis.edu/bill/relay.c http://cse.ucdavis.edu/bill/mpi_nxnlatbw.c So to compile, assuming a sane environment: mpicc -O3 relay.c -o relay The command to run an MPI program varies by environment and mpi implementation, and batch queue environment (especially tight integration). It should be something close to: mpirun -np -machinefile ./relay 1 mpirun -np -machinefile ./relay 1024 mpirun -np -machinefile ./relay 8192 You should see something like: c0-8 c0-22 size= 1, 16384 hops, 2 nodes in 0.75 sec ( 45.97 us/hop) 85 KB/sec c0-8 c0-22 size= 1024, 16384 hops, 2 nodes in 2.00 sec (121.94 us/hop) 32803 KB/sec c0-8 c0-22 size= 8192, 16384 hops, 2 nodes in 6.21 sec (379.05 us/hop) 84421 KB/sec So basically on a tiny packet 45us of latency (normal for gigE), and on a large package 84MB/sec or so (normal for GigE). I'd start with 2 nodes, then if you are happy try it with all nodes. Now for infiniband you should see something like: c0-5 c0-4 size= 1, 16384 hops, 2 nodes in 0.03 sec ( 1.72 us/hop) 2274 KB/sec c0-5 c0-4 size= 1024, 16384 hops, 2 nodes in 0.16 sec ( 9.92 us/hop) 403324 KB/sec c0-5 c0-4 size= 8192, 16384 hops, 2 nodes in 0.50 sec ( 30.34 us/hop) 1054606 KB/sec Note the latency is some 25 times less and the bandwidth some 10+ times higher. Note the hostnames are different, don't run multiple copies on the same node unless you intend to. Running 4 copies on a 4 cpu node doesn't test infiniband. So once you get what you expect I'd suggest something a bit more comprehensive. Something like: mpirun -np -machinefile ./mpi_nxnlatbw I'd expect some different in latency and bandwidth between nodes, but not any big differences. Something like: [0<->1] 1.85us 1398.825264 (MillionBytes/sec) [0<->2] 1.75us 1300.812337 (MillionBytes/sec) [0<->3] 1.76us 1396.205242 (MillionBytes/sec) [0<->4] 1.68us 1398.647324 (MillionBytes/sec) [1<->0] 1.82us 1375.550155 (MillionBytes/sec) [1<->2] 1.69us 1397.936020 (MillionBytes/sec) ... Once those numbers are consistent and where you expect them (both latency and bandwidth) I'd follow up with a production code that produces a known answer and is likely to provide much wider MPI coverage. From jlforrest at berkeley.edu Tue Nov 17 16:26:29 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 17 Nov 2009 16:26:29 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B032DA5.2010106@cse.ucdavis.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> Message-ID: <4B033F35.4020106@berkeley.edu> For what it's worth, I'm using 10 nodes, where each node has 12 cores. I'm also using Rocks with the Mellonox roll. My HCA is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) > mpirun -np -machinefile ./relay 1 > mpirun -np -machinefile ./relay 1024 > mpirun -np -machinefile ./relay 8192 > > You should see something like: > c0-8 c0-22 > size= 1, 16384 hops, 2 nodes in 0.75 sec ( 45.97 us/hop) 85 KB/sec > c0-8 c0-22 > size= 1024, 16384 hops, 2 nodes in 2.00 sec (121.94 us/hop) 32803 KB/sec > c0-8 c0-22 > size= 8192, 16384 hops, 2 nodes in 6.21 sec (379.05 us/hop) 84421 KB/sec > > So basically on a tiny packet 45us of latency (normal for gigE), and on a > large package 84MB/sec or so (normal for GigE). > > I'd start with 2 nodes, then if you are happy try it with all nodes. Since there are 10 nodes, I did the following, with the results shown (I removed the node names): $ mpirun -np 10 -machinefile hosts ./relay 1 size= 1, 16384 hops, 10 nodes in 0.20 sec ( 12.44 us/hop) 314 KB/sec $ mpirun -np 10 -machinefile hosts ./relay 1024 size= 1024, 16384 hops, 10 nodes in 0.33 sec ( 20.40 us/hop) 196074 KB/sec $ mpirun -np 10 -machinefile hosts ./relay 8192 size= 8192, 16384 hops, 10 nodes in 0.97 sec ( 59.51 us/hop) 537734 KB/sec I believe these are with IB. > So once you get what you expect I'd suggest something a bit more > comprehensive. Something like: > mpirun -np -machinefile ./mpi_nxnlatbw > > I'd expect some different in latency and bandwidth between nodes, but not any > big differences. Something like: > [0<->1] 1.85us 1398.825264 (MillionBytes/sec) I did the following, with the results shown: $ mpirun -np 2 -machinefile hosts ./mpi_nxnlatbw [0<->1] 3.67us 1289.409397 (MillionBytes/sec) [1<->0] 3.67us 1276.377689 (MillionBytes/sec) I also ran this with more nodes but the point-to-point times were about the same. Does this look right? Based on your numbers, it looks like my IB is slower than yours. Because of the strange way the OFED was installed, I can't easily run over just ethernet. Thanks for your help -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From jlforrest at berkeley.edu Tue Nov 17 17:01:12 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 17 Nov 2009 17:01:12 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B032DA5.2010106@cse.ucdavis.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> Message-ID: <4B034758.6040401@berkeley.edu> I had said "I believe these are with IB." Now I'm not so sure. I just did a "ifconfig ib0" on all the nodes and they all say BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) So, it doesn't look like any of these tests used IB, although I'm not sure because some of those numbers looked better than I'd expect for just 1Gb ethernet. I'll have to figure out how to force IB when using OpenMPI. -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From tom.elken at qlogic.com Tue Nov 17 17:35:48 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Tue, 17 Nov 2009 17:35:48 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B033F35.4020106@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> <4B033F35.4020106@berkeley.edu> Message-ID: <35AAF1E4A771E142979F27B51793A4888702F998AE@AVEXMB1.qlogic.org> > On Behalf Of Jon Forrest > My HCA is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA] > (rev 20) > > > I did the following, with the results shown: > > $ mpirun -np 2 -machinefile hosts ./mpi_nxnlatbw > [0<->1] 3.67us 1289.409397 (MillionBytes/sec) > [1<->0] 3.67us 1276.377689 (MillionBytes/sec) > > I also ran this with more nodes but the point-to-point > times were about the same. > > Does this look right? For InfiniHost III, these numbers look right, and you are using IB. You may get somewhat higher bandwidth using OSU MPI Benchmarks or Intel MPI Benchmarks (formerly Pallas) because a fairly modest message size is used by mpi_nxnlatbw's bandwidth test. It is written to get somewhat close to peak bandwidth and best latency and run over a fairly large cluster in a reasonable amount of time. But as a result, the bandwidth test runs so quickly that taking an OS interrupt can skew a few of the results. Before concluding that a link is underperforming based on mpi_nxnlatbw, re-run the test to see if the same link is slow, or use another more comprehensive benchmark like OMB or IMB. -Tom > Based on your numbers, it looks like my > IB is slower than yours. Because of the strange way the OFED > was installed, I can't easily run over just ethernet. > > Thanks for your help > > > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From worringen at googlemail.com Tue Nov 17 12:14:28 2009 From: worringen at googlemail.com (Joachim Worringen) Date: Tue, 17 Nov 2009 12:14:28 -0800 Subject: [Beowulf] Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16 2009 In-Reply-To: References: Message-ID: <981e81f00911171214h68a899f0me2fbb11124dc90a5@mail.gmail.com> On Mon, Nov 16, 2009 at 1:40 AM, Donald Becker wrote: > > > Final Announcement: 11th Annual Beowulf Bash 9pm Nov 16 2009 > > > 11th Annual Beowulf Bash > And > LECCIBG > > Thanks for this great event - Norman Sylvester rocks! Joachim -------------- next part -------------- An HTML attachment was scrubbed... URL: From siegert at sfu.ca Tue Nov 17 18:30:00 2009 From: siegert at sfu.ca (Martin Siegert) Date: Tue, 17 Nov 2009 18:30:00 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B034758.6040401@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> <4B034758.6040401@berkeley.edu> Message-ID: <20091118023000.GB453@stikine.its.sfu.ca> On Tue, Nov 17, 2009 at 05:01:12PM -0800, Jon Forrest wrote: > I had said "I believe these are with IB." > Now I'm not so sure. I just did a > > "ifconfig ib0" > > on all the nodes and they all say > > BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) AFAIK, ifconfig ib0 will show you the ipoib numbers. Since MPI (hopefully) is not using this, you see zeros. > So, it doesn't look like any of these tests used IB, > although I'm not sure because some of those numbers > looked better than I'd expect for just 1Gb ethernet. > > I'll have to figure out how to force IB when > using OpenMPI. Edit your ~/.openmpi/mca-params.conf file and add the line btl = ^tcp That will explicitly prevent openmpi using tcp (it would use ib before tcp by default, but this way it will fail if ib is not working). > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf Cheers, Martin -- Martin Siegert Head, Research Computing WestGrid Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 From gus at ldeo.columbia.edu Tue Nov 17 19:09:02 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 17 Nov 2009 22:09:02 -0500 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <20091118023000.GB453@stikine.its.sfu.ca> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> <4B034758.6040401@berkeley.edu> <20091118023000.GB453@stikine.its.sfu.ca> Message-ID: <4B03654E.4010002@ldeo.columbia.edu> Martin Siegert wrote: > On Tue, Nov 17, 2009 at 05:01:12PM -0800, Jon Forrest wrote: >> I had said "I believe these are with IB." >> Now I'm not so sure. I just did a >> >> "ifconfig ib0" >> >> on all the nodes and they all say >> >> BROADCAST MULTICAST MTU:65520 Metric:1 >> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > AFAIK, ifconfig ib0 will show you the ipoib numbers. Since MPI > (hopefully) is not using this, you see zeros. > >> So, it doesn't look like any of these tests used IB, >> although I'm not sure because some of those numbers >> looked better than I'd expect for just 1Gb ethernet. >> >> I'll have to figure out how to force IB when >> using OpenMPI. > > Edit your ~/.openmpi/mca-params.conf file and add the line > > btl = ^tcp > > That will explicitly prevent openmpi using tcp (it would use ib before > tcp by default, but this way it will fail if ib is not working). > Hi Jon Martin's suggestion is the the best, particularly if you plan to always use IB, never use TCP. Alternatively you could include these mca parameters on the mpiexec command line to select IB: -mca btl openib,sm,self OpenMPI has several mechanisms to make these choices. See these FAQ: http://www.open-mpi.org/faq/?category=sysadmin#sysadmin-mca-params http://www.open-mpi.org/faq/?category=tuning#setting-mca-params My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- >> -- >> Jon Forrest >> Research Computing Support >> College of Chemistry >> 173 Tan Hall >> University of California Berkeley >> Berkeley, CA >> 94720-1460 >> 510-643-1032 >> jlforrest at berkeley.edu >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > Cheers, > Martin > From bill at cse.ucdavis.edu Tue Nov 17 19:18:58 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 17 Nov 2009 19:18:58 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B034758.6040401@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> <4B02EC6D.60107@cse.ucdavis.edu> <4B02F263.9030607@berkeley.edu> <4B032DA5.2010106@cse.ucdavis.edu> <4B034758.6040401@berkeley.edu> Message-ID: <4B0367A2.4080803@cse.ucdavis.edu> Jon Forrest wrote: > I had said "I believe these are with IB." > Now I'm not so sure. I just did a The performance numbers you showed from relay and mpi_nxnlatbw are definitely much faster than GigE. Unless it's multiple copies running on a single machine (thus printing the hostname). Assuming that it was actually using the interconnect (not multiple copies running on a single machine) > > "ifconfig ib0" I suspect this is for TCPIP over ib, and doesn't show MPI traffic. You didn't mention which controllers do you have? I suspect that there is a tool to show the various counters on the HCA, let alone on the switch side. > I'll have to figure out how to force IB when > using OpenMPI. Looks like IB to me, might want to do the reverse to see the real differential, I find it very handy for cost justifying IB on future clusters based on real application performance. From deadline at eadline.org Tue Nov 17 21:38:53 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 18 Nov 2009 00:38:53 -0500 (EST) Subject: [Beowulf] The Limulus Case Message-ID: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org> If you are at SC09 stop by the SICORP booth (1209) (I managed to wrangle a pedestal) to see the the Limulus case - four microATX motherboards in one case. I'll be around at times to answer questions. Jess Cannata is also helping out. If you are not at the show or want to see what I'm talking about, you can see some pictures here: http://limulus.basement-supercomputing.com/wiki/LimulusCase BTW the Beobash was huge success, I think we had over 450 people. Pictures are up at InsideHPC http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/ -- Doug From prentice at ias.edu Tue Nov 17 21:11:41 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 17 Nov 2009 21:11:41 -0800 Subject: [Beowulf] How Would You Test Infiniband in New Cluster? In-Reply-To: <4B02E74F.8050103@berkeley.edu> References: <4B02E74F.8050103@berkeley.edu> Message-ID: <4B03820D.8090305@ias.edu> Jon Forrest wrote: > Let's say you have a brand new cluster with > brand new Infiniband hardware, and that > you've installed OFED 1.4 and the > appropriate drivers for your IB > HCAs (i.e. you see ib0 devices > on the frontend and all compute nodes). > The cluster appears to be working > fine but you're not sure about IB. > > How would you test your IB network > to make sure all is well? > > Cordially, I would start with the basic IB diagnostic utilities. On RHEL-based systems, they are in the infiniband-diags rpm. I have limited experience of them myself, but you can check the man pages. They may not give you performance metrics, but can definitely help you determine if everything is connected and working properly. Here's a list of the commands available from this package in my RHEL 5.3 rebuild: $ rpm -ql infiniband-diags | grep bin /usr/sbin/check_lft_balance.pl /usr/sbin/dump_lfts.sh /usr/sbin/dump_mfts.sh /usr/sbin/ibaddr /usr/sbin/ibcheckerrors /usr/sbin/ibcheckerrs /usr/sbin/ibchecknet /usr/sbin/ibchecknode /usr/sbin/ibcheckport /usr/sbin/ibcheckportstate /usr/sbin/ibcheckportwidth /usr/sbin/ibcheckstate /usr/sbin/ibcheckwidth /usr/sbin/ibclearcounters /usr/sbin/ibclearerrors /usr/sbin/ibdatacounters /usr/sbin/ibdatacounts /usr/sbin/ibdiscover.pl /usr/sbin/ibfindnodesusing.pl /usr/sbin/ibhosts /usr/sbin/ibidsverify.pl /usr/sbin/iblinkinfo.pl /usr/sbin/ibnetdiscover /usr/sbin/ibnodes /usr/sbin/ibping /usr/sbin/ibportstate /usr/sbin/ibprintca.pl /usr/sbin/ibprintrt.pl /usr/sbin/ibprintswitch.pl /usr/sbin/ibqueryerrors.pl /usr/sbin/ibroute /usr/sbin/ibrouters /usr/sbin/ibstat /usr/sbin/ibstatus /usr/sbin/ibswitches /usr/sbin/ibswportwatch.pl /usr/sbin/ibsysstat /usr/sbin/ibtracert /usr/sbin/perfquery /usr/sbin/saquery /usr/sbin/set_nodedesc.sh /usr/sbin/sminfo /usr/sbin/smpdump /usr/sbin/smpquery /usr/sbin/vendstat From prentice at ias.edu Tue Nov 17 21:21:27 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 17 Nov 2009 21:21:27 -0800 Subject: [Beowulf] The Limulus Case In-Reply-To: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org> References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org> Message-ID: <4B038457.4060303@ias.edu> Douglas Eadline wrote: > If you are at SC09 stop by the SICORP booth (1209) (I managed to > wrangle a pedestal) to see the the Limulus case - four microATX motherboards > in one case. I'll be around at times to answer questions. > Jess Cannata is also helping out. > > If you are not at the show or want to see what I'm talking > about, you can see some pictures here: > > http://limulus.basement-supercomputing.com/wiki/LimulusCase > Doug, I was wandering around looking for this today. I wish I had my laptop with me to get this information then. > BTW the Beobash was huge success, I think we had over 450 > people. Pictures are up at InsideHPC > > http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/ > If Walt wants some accompaniment next year, I can bring my bass and Bill Wichser from Princeton said he'd bring his harmonica. At least he said he would last night. He might not remember this morning. Prentice From deadline at eadline.org Wed Nov 18 07:50:48 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 18 Nov 2009 10:50:48 -0500 (EST) Subject: [Beowulf] The Limulus Case In-Reply-To: <4B038457.4060303@ias.edu> References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org> <4B038457.4060303@ias.edu> Message-ID: <35706.173.8.196.93.1258559448.squirrel@mail.eadline.org> Look for the Appro booth, it is right next to it. On Thursday I'll have more time, I can open the case and play with it some more. Today I'm continuing my totally un-professional video interviews for Linux magazine. You know an "HPC gone Wild" kind of thing. If you see me with the camera, say hi share your thoughts, or what ever. -- Doug > Douglas Eadline wrote: >> If you are at SC09 stop by the SICORP booth (1209) (I managed to >> wrangle a pedestal) to see the the Limulus case - four microATX >> motherboards >> in one case. I'll be around at times to answer questions. >> Jess Cannata is also helping out. >> >> If you are not at the show or want to see what I'm talking >> about, you can see some pictures here: >> >> http://limulus.basement-supercomputing.com/wiki/LimulusCase >> > Doug, > > I was wandering around looking for this today. I wish I had my laptop > with me to get this information then. >> BTW the Beobash was huge success, I think we had over 450 >> people. Pictures are up at InsideHPC >> >> http://insidehpc.com/2009/11/17/beowulf-bash-2009-success/ >> > If Walt wants some accompaniment next year, I can bring my bass and Bill > Wichser from Princeton said he'd bring his harmonica. At least he said > he would last night. He might not remember this morning. > > Prentice > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From prentice at ias.edu Wed Nov 18 08:00:46 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 18 Nov 2009 08:00:46 -0800 Subject: [Beowulf] The Limulus Case In-Reply-To: <35706.173.8.196.93.1258559448.squirrel@mail.eadline.org> References: <51306.140.221.229.202.1258522733.squirrel@mail.eadline.org> <4B038457.4060303@ias.edu> <35706.173.8.196.93.1258559448.squirrel@mail.eadline.org> Message-ID: <4B041A2E.5030801@ias.edu> Douglas Eadline wrote: > Look for the Appro booth, it is right next to it. > On Thursday I'll have more time, I can open the case > and play with it some more. > > Today I'm continuing my totally un-professional > video interviews for Linux magazine. You know > an "HPC gone Wild" kind of thing. > > If you see me with the camera, say hi share your thoughts, > or what ever. > > -- > Doug > > Can I lift up my shirt and flash the audience? Prentice From jbardin at bu.edu Thu Nov 19 10:18:41 2009 From: jbardin at bu.edu (james bardin) Date: Thu, 19 Nov 2009 13:18:41 -0500 Subject: [Beowulf] Large raid rebuild times Message-ID: Hello, Has anyone here seen any numbers, or tested themselves, the rebuild times for large raid arrays (raid 6 specifically)? I can't seem to find anything concrete to go by, and I haven't had to rebuild anything larger than a few TB. What happens when you loose a 2TB drive in a 20TB array for instance? Do any hardware raid solutions help. I don't think ZFS is an option right now, so I'm looking at Linux and/or hardware raid. Thanks -jim From vanallsburg at hope.edu Thu Nov 19 11:18:49 2009 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Thu, 19 Nov 2009 14:18:49 -0500 Subject: [Beowulf] Large raid rebuild times In-Reply-To: References: Message-ID: <4B059A19.909@hope.edu> james bardin wrote: > Hello, > > Has anyone here seen any numbers, or tested themselves, the rebuild > times for large raid arrays (raid 6 specifically)? > I can't seem to find anything concrete to go by, and I haven't had to > rebuild anything larger than a few TB. What happens when you loose a > 2TB drive in a 20TB array for instance? Do any hardware raid solutions > help. I don't think ZFS is an option right now, so I'm looking at > Linux and/or hardware raid. > > > Thanks > -jim > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > Jim, I have a new Infortrend a24s-g2130 with 24 x 1T drives. I setup a raid6 with 2 spares and during the raid initialization a drive failed. The raid6 initialization took 20 hours and the rebuild took 16 hours. I've since changed the rebuild priority from normal to high but have not had any more drives fail. Cheers, Paul http://www.infortrend.com/main/2_product/es_a24s-g2130.asp -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ From rsandilands at authentium.com Thu Nov 19 12:29:12 2009 From: rsandilands at authentium.com (Robert Sandilands) Date: Thu, 19 Nov 2009 15:29:12 -0500 Subject: [Beowulf] Re: Large raid rebuild times In-Reply-To: <200911192000.nAJK077q009800@bluewest.scyld.com> References: <200911192000.nAJK077q009800@bluewest.scyld.com> Message-ID: <4B05AA98.4020401@authentium.com> I am currently rebuilding a 28 x 1 TB RAID 5 volume that is based on the SurfRAID TRITON 16S3 with a JBOD unit. It is 16.6% complete after 2 1/2 hours. I am rebuilding before copying the data and returning all the Seagate 1 TB drives. Replacing it with Hitachi 2 TB drives and RAID 6. The combination of a large number of drives in RAID 5 and/or Seagate 1 TB drives is not to be recommended. Robert beowulf-request at beowulf.org wrote: > > Hello, > > Has anyone here seen any numbers, or tested themselves, the rebuild > times for large raid arrays (raid 6 specifically)? > I can't seem to find anything concrete to go by, and I haven't had to > rebuild anything larger than a few TB. What happens when you loose a > 2TB drive in a 20TB array for instance? Do any hardware raid solutions > help. I don't think ZFS is an option right now, so I'm looking at > Linux and/or hardware raid. > > > From dimitrios.v.gerasimatos at jpl.nasa.gov Thu Nov 19 20:37:17 2009 From: dimitrios.v.gerasimatos at jpl.nasa.gov (Gerasimatos, Dimitrios V (343K)) Date: Thu, 19 Nov 2009 20:37:17 -0800 Subject: [Beowulf] Re: Large raid rebuild times In-Reply-To: <200911192000.nAJK077r009800@bluewest.scyld.com> References: <200911192000.nAJK077r009800@bluewest.scyld.com> Message-ID: <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL> If rebuild times are a problem then I recommend you go with ZFS or a Netapp. High-performance Netapp can be expensive. ZFS doesn't have to be. A fsck of a large filesystem can easily take > 24 hours. Dimitri -- Dimitrios Gerasimatos dimitrios.gerasimatos at jpl.nasa.gov Section 343 Jet Propulsion Laboratory 4800 Oak Grove Dr. Mail Stop 264-820 Pasadena, CA 91109 Voice: 818.354.4910 FAX: 818.393.7413 Cell: 818.726.8617 -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of beowulf-request at beowulf.org Sent: Thursday, November 19, 2009 12:01 PM To: beowulf at beowulf.org Subject: Beowulf Digest, Vol 69, Issue 22 Send Beowulf mailing list submissions to beowulf at beowulf.org To subscribe or unsubscribe via the World Wide Web, visit http://www.beowulf.org/mailman/listinfo/beowulf or, via email, send a message with subject or body 'help' to beowulf-request at beowulf.org You can reach the person managing the list at beowulf-owner at beowulf.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Beowulf digest..." Today's Topics: 1. Large raid rebuild times (james bardin) 2. Re: Large raid rebuild times (Paul Van Allsburg) ---------------------------------------------------------------------- Message: 1 Date: Thu, 19 Nov 2009 13:18:41 -0500 From: james bardin Subject: [Beowulf] Large raid rebuild times To: beowulf at beowulf.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hello, Has anyone here seen any numbers, or tested themselves, the rebuild times for large raid arrays (raid 6 specifically)? I can't seem to find anything concrete to go by, and I haven't had to rebuild anything larger than a few TB. What happens when you loose a 2TB drive in a 20TB array for instance? Do any hardware raid solutions help. I don't think ZFS is an option right now, so I'm looking at Linux and/or hardware raid. Thanks -jim ------------------------------ Message: 2 Date: Thu, 19 Nov 2009 14:18:49 -0500 From: Paul Van Allsburg Subject: Re: [Beowulf] Large raid rebuild times To: james bardin Cc: beowulf at beowulf.org Message-ID: <4B059A19.909 at hope.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed james bardin wrote: > Hello, > > Has anyone here seen any numbers, or tested themselves, the rebuild > times for large raid arrays (raid 6 specifically)? > I can't seem to find anything concrete to go by, and I haven't had to > rebuild anything larger than a few TB. What happens when you loose a > 2TB drive in a 20TB array for instance? Do any hardware raid solutions > help. I don't think ZFS is an option right now, so I'm looking at > Linux and/or hardware raid. > > > Thanks > -jim > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > Jim, I have a new Infortrend a24s-g2130 with 24 x 1T drives. I setup a raid6 with 2 spares and during the raid initialization a drive failed. The raid6 initialization took 20 hours and the rebuild took 16 hours. I've since changed the rebuild priority from normal to high but have not had any more drives fail. Cheers, Paul http://www.infortrend.com/main/2_product/es_a24s-g2130.asp -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ ------------------------------ _______________________________________________ Beowulf mailing list Beowulf at beowulf.org http://www.beowulf.org/mailman/listinfo/beowulf End of Beowulf Digest, Vol 69, Issue 22 *************************************** From eagles051387 at gmail.com Fri Nov 20 02:21:26 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Fri, 20 Nov 2009 10:21:26 +0000 Subject: [Beowulf] Re: Large raid rebuild times In-Reply-To: <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL> References: <200911192000.nAJK077r009800@bluewest.scyld.com> <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL> Message-ID: wouldnt the limiting factor be the read and write times of the drives in regards to rebuilding an array? -------------- next part -------------- An HTML attachment was scrubbed... URL: From pal at di.fct.unl.pt Fri Nov 20 04:13:12 2009 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri, 20 Nov 2009 12:13:12 -0000 (WET) Subject: [Beowulf] Re: Large raid rebuild times Message-ID: <21910.89.180.105.138.1258719192.squirrel@webmail.fct.unl.pt> (Sorry: I forgot to cc the mailing list) > wouldnt the limiting factor be the read and write times of the drives in regards to rebuilding an array? I would expect the write BW of a single drive (the one being rebuilt) in sequential access mode to be the limiting factor. If, in a 5 drive RAID-5, a single drive sustains, say, 50 MB/s write BW (and, say, 50MB/s read for simplicity): - time for reading 4 drives in parallel == time for reading 1 drive (e.g., 1TB at 50MB/s == 5,5 hours) If the data recovery compute time is negligible (I expect it to be), then the time to rebuild a 5 disk RAID5 with 1 TB disks should be better than 6h, provided that the "host" is able to sustain more than 250 MB/s for the disks. If you are using a disk-array (not an entry-level cheapo, but FC ones like IBM DS4800, HP EVA 6000, EMC CX500 or better - these are old models) with no activity other than the rebuilding, it should take less than 6h for a 5x 1TB/drive RAID5, IMO. Regards, -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: poral at fct.unl.pt 2829-516 Caparica, PORTUGAL From landman at scalableinformatics.com Fri Nov 20 06:32:08 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 20 Nov 2009 09:32:08 -0500 Subject: [Beowulf] Re: Large raid rebuild times In-Reply-To: <4B05AA98.4020401@authentium.com> References: <200911192000.nAJK077q009800@bluewest.scyld.com> <4B05AA98.4020401@authentium.com> Message-ID: <4B06A868.5050000@scalableinformatics.com> Robert Sandilands wrote: > I am currently rebuilding a 28 x 1 TB RAID 5 volume that is based on the Hmmm.... > SurfRAID TRITON 16S3 with a JBOD unit. It is 16.6% complete after 2 1/2 > hours. > > I am rebuilding before copying the data and returning all the Seagate 1 > TB drives. Replacing it with Hitachi 2 TB drives and RAID 6. > > The combination of a large number of drives in RAID 5 and/or Seagate 1 > TB drives is not to be recommended. Actually, you shouldn't be using RAID5 any more. The bit error rates suggest that you will in all likelihood, hit an uncorrectable error within a very small number of rebuilds, and lose data. This is fairly well known at this point, so I have to admit surprise to hear of anyone using 20+ drives of TB size in a RAID5. On Seagate, YMMV, they have been rock solid for us and our customers (thousands shipped, scales of PBs of storage). Failure rates somewhat above their statistical estimates, very much in line with what Google indicates they have observed with their drives. We haven't used Hitachi very much in our units, so I can't comment much on their quality/failure rate. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Fri Nov 20 06:43:48 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 20 Nov 2009 09:43:48 -0500 Subject: [Beowulf] Large raid rebuild times In-Reply-To: References: Message-ID: <4B06AB24.7020302@scalableinformatics.com> james bardin wrote: > Hello, > > Has anyone here seen any numbers, or tested themselves, the rebuild Hi James, Yes, we test this (by purposely failing a drive on the storage we ship to our customers). > times for large raid arrays (raid 6 specifically)? Yes, RAID6 with up to 24 drives per RAID, up to 2TB drives. Due to the bit error rate failure models for the uncorrectable errors on disks, RAID5 is *strongly* contra-indicated for storage of more than a few small TB, and more than a few drives (less then 5 and less than 1TB). The risk of a second failure during rebuild is simply unacceptably high, which would/does permanently take out your data in RAID5. > I can't seem to find anything concrete to go by, and I haven't had to > rebuild anything larger than a few TB. What happens when you loose a > 2TB drive in a 20TB array for instance? Do any hardware raid solutions > help. I don't think ZFS is an option right now, so I'm looking at We have customers with 32TB raw per RAID, and when a drive fails, it rebuilds. Rebuild time is a function of how fast the card is set up to do rebuilds, you can tune the better cards in terms of "background" rebuild performance. For low rebuild speeds, we have seen 24 hours+, for high rebuild speeds, we have seen 12-15 hours for the 32TB. ZFS is probably not what you want to do ... building a critical dependency upon a product that has a somewhat uncertain future ... Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jbardin at bu.edu Fri Nov 20 08:56:18 2009 From: jbardin at bu.edu (james bardin) Date: Fri, 20 Nov 2009 11:56:18 -0500 Subject: [Beowulf] Large raid rebuild times In-Reply-To: <4B06AB24.7020302@scalableinformatics.com> References: <4B06AB24.7020302@scalableinformatics.com> Message-ID: Hi Joe, On Fri, Nov 20, 2009 at 9:43 AM, Joe Landman wrote: > > ?We have customers with 32TB raw per RAID, and when a drive fails, it > rebuilds. ?Rebuild time is a function of how fast the card is set up to do > rebuilds, you can tune the better cards in terms of "background" rebuild > performance. ?For low rebuild speeds, we have seen 24 hours+, for high > rebuild speeds, we have seen 12-15 hours for the 32TB. > Thanks for that. That sounds inline with my expectations. Any chance you've compared linux md raid6 to hardware solutions for your devices? > ?ZFS is probably not what you want to do ... building a critical dependency > upon a product that has a somewhat uncertain future ... > I'm not too worried about zfs - it has plenty of following. I'm personally waiting for btrfs to stabilize so we can start testing, but that's a ways off. The issue here is that the group setting up these storage servers is an all linux shop, and they don't want the overhead of another OS. Thanks -jim From hahn at mcmaster.ca Fri Nov 20 13:10:23 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 20 Nov 2009 16:10:23 -0500 (EST) Subject: [Beowulf] Re: Large raid rebuild times In-Reply-To: References: <200911192000.nAJK077r009800@bluewest.scyld.com> <6F127CF61C0FE143B5E46BF06036094F952A7CC595@ALTPHYEMBEVSP30.RES.AD.JPL> Message-ID: > wouldnt the limiting factor be the read and write times of the drives in > regards to rebuilding an array? assuming your controller/xor engine/bus/etc are not a bottleneck, rebuild takes 2 * singleDiskSize / singleDiskSpeed. for a current-gen 2TB disk at 100 MB/s, that's only 6 hours. (in particular, rebuild speed need not be proportional to the total volume size, since all data blocks can be read concurrently.) since people are talking about much longer times than that, they must have bottlenecks (or slow disks). online reconstruction (where the volume is available for use while rebuilding) normally causes a significant slowdown as well. From amjad11 at gmail.com Sat Nov 21 17:51:26 2009 From: amjad11 at gmail.com (amjad ali) Date: Sat, 21 Nov 2009 20:51:26 -0500 Subject: [Beowulf] Performance profiling/tuning on different systems Message-ID: <428810f20911211751r37749319v9cd9e1cd01d021d5@mail.gmail.com> Hi all, Suppose a code is tuned on a specific system (e.g Intel Xeon based using Vtune or Trace Collector). Then to how much extent this tuning will be useful if this code is compiled and run on some other system (e.g. AMD Opteron based)? Means whether the code tuning performed at one system (possibly with a Profiler specific for that system) is almost equivalently good on another system ? Or we need to tune it again (possibly using a Profiler specific to the new system). Thank you for your attention. A.Ali -------------- next part -------------- An HTML attachment was scrubbed... URL: From thakur at mcs.anl.gov Sun Nov 22 15:55:58 2009 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Sun, 22 Nov 2009 17:55:58 -0600 Subject: [Beowulf] FW: MPI Forum community feedback survey Message-ID: -----Original Message----- From: Jeff Squyres Sent: Friday, November 20, 2009 4:01 PM Subject: MPI Forum community feedback survey The MPI Forum announced at its SC09 BOF that they are soliciting community feedback to help guide the MPI-3 standards process. A survey is available online at the following URL: http://mpi-forum.questionpro.com/ Password: mpi3 In this survey, the MPI Forum is asking as many people as possible for feedback on the MPI-3 process -- what features to include, what features to not include, etc. We encourage you to forward this survey on to as many interested and relevant parties as possible. It will take approximately 10 minutes to complete the questionnaire. No question in the survey is mandatory; feel free to only answer the questions which are relevant to you and your applications. Your answers will help the MPI Forum guide its process to create a genuinely useful MPI-3 standard. This survey closes December 31, 2009. Your survey responses will be strictly confidential and data from this research will be reported only in the aggregate. Your information will be coded and will remain confidential. If you have questions at any time about the survey or the procedures, you may contact the MPI Forum via email to mpi-comments at mpi-forum.org. Thank you very much for your time and support. -- Jeff Squyres From csamuel at vpac.org Sun Nov 22 19:57:44 2009 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 23 Nov 2009 14:57:44 +1100 (EST) Subject: [Beowulf] Performance profiling/tuning on different systems In-Reply-To: <428810f20911211751r37749319v9cd9e1cd01d021d5@mail.gmail.com> Message-ID: <26188990.651258948662183.JavaMail.csamuel@sys26> ----- "amjad ali" wrote: > Hi all, Hiya, > Or we need to tune it again (possibly using a Profiler > specific to the new system). The only way to know for certain is to test it and see. Best of luck! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From brockp at umich.edu Mon Nov 23 07:13:57 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 23 Nov 2009 10:13:57 -0500 Subject: [Beowulf] SC09 podcast for those who missed the show Message-ID: <089B1C9F-F3A7-467D-A288-1EFFBE647571@umich.edu> For those of you who did not make it to sc09, Jeff Squyres and Brock Palen (that's me), did a special version of our podcast (rce- cast.com) on some of the things we took away from the show. Show notes and mp3 download http://www.rce-cast.com/index.php/Podcast/rce-21-sc09-supercomputing-09.html iTunes Subscribe: http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=302882307 RSS Feed: http://www.rce-cast.com/index.php/component/option,com_bca-rss-syndicator/feed_id,1/ Feel free to contact me off list with show ideas! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From thakur at mcs.anl.gov Sun Nov 22 16:06:54 2009 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Sun, 22 Nov 2009 18:06:54 -0600 Subject: [Beowulf] [hpc-announce] FW: MPI Forum community feedback survey Message-ID: -----Original Message----- From: Jeff Squyres Sent: Friday, November 20, 2009 4:01 PM Subject: MPI Forum community feedback survey The MPI Forum announced at its SC09 BOF that they are soliciting community feedback to help guide the MPI-3 standards process. A survey is available online at the following URL: http://mpi-forum.questionpro.com/ Password: mpi3 In this survey, the MPI Forum is asking as many people as possible for feedback on the MPI-3 process -- what features to include, what features to not include, etc. We encourage you to forward this survey on to as many interested and relevant parties as possible. It will take approximately 10 minutes to complete the questionnaire. No question in the survey is mandatory; feel free to only answer the questions which are relevant to you and your applications. Your answers will help the MPI Forum guide its process to create a genuinely useful MPI-3 standard. This survey closes December 31, 2009. Your survey responses will be strictly confidential and data from this research will be reported only in the aggregate. Your information will be coded and will remain confidential. If you have questions at any time about the survey or the procedures, you may contact the MPI Forum via email to mpi-comments at mpi-forum.org. Thank you very much for your time and support. -- Jeff Squyres From rpnabar at gmail.com Mon Nov 23 16:31:47 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 23 Nov 2009 18:31:47 -0600 Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS In-Reply-To: References: Message-ID: On Tue, Nov 17, 2009 at 1:21 PM, Andrew Shewmaker wrote: > > Here's something on EFI I wrote up for myself in 2005. ?It's a bit out > of date, but it covers some stuff that wikipedia doesn't. ?In > particular, I would read the old Kernel Traffic to understand how > various developers dislike EFI. Thanks Andrew! Very useful. That explains a lot of stuff about EFI. -- Rahul From lindahl at pbm.com Mon Nov 23 16:35:34 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 23 Nov 2009 16:35:34 -0800 Subject: [Beowulf] CentOS plus Fedora kernel? Message-ID: <20091124003534.GA15927@bx9.net> For reasons complicated to explain, I want to run a Fedora kernel on CentOS 5. Does anyone have any words of wisdom or pointers to webpages for people who've done this? -- greg p.s. missed you guys at SC, I was stuck racking 500 servers... http://www.flickr.com/photos/skrenta/sets/72157622738924345/ From geoff at galitz.org Tue Nov 24 03:05:54 2009 From: geoff at galitz.org (Geoff Galitz) Date: Tue, 24 Nov 2009 12:05:54 +0100 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124003534.GA15927@bx9.net> References: <20091124003534.GA15927@bx9.net> Message-ID: <486144B3B40B401D9D45AAE089D7D5BD@geoffPC> It would be safer to custom build a new Centos kernel. There are also "enhanced" kernels available for Centos in the centosplus software repository. They include support for technologies not normally found in Centos. Without knowing more about what you are looking for, I'd recommend checking out centosplus before exploring kernels from alternate distributions. That is the only wisdom I have... as minor as it is. -geoff --------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/ > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On > Behalf Of Greg Lindahl > Sent: Dienstag, 24. November 2009 01:36 > To: beowulf at beowulf.org > Subject: [Beowulf] CentOS plus Fedora kernel? > > For reasons complicated to explain, I want to run a Fedora kernel on > CentOS 5. Does anyone have any words of wisdom or pointers to webpages > for people who've done this? > > -- greg > > p.s. missed you guys at SC, I was stuck racking 500 servers... > http://www.flickr.com/photos/skrenta/sets/72157622738924345/ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From eagles051387 at gmail.com Tue Nov 24 03:54:45 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 24 Nov 2009 12:54:45 +0100 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <486144B3B40B401D9D45AAE089D7D5BD@geoffPC> References: <20091124003534.GA15927@bx9.net> <486144B3B40B401D9D45AAE089D7D5BD@geoffPC> Message-ID: you also have to ask yourself what does the fedora kernel have that centos doesnt and that you cant add with a recompilation of the kernel or as mentioned in the previous email from the centosplus repo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dnlombar at ichips.intel.com Tue Nov 24 08:34:11 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Tue, 24 Nov 2009 08:34:11 -0800 Subject: [Beowulf] UEFI (Unified Extensible Firmware Interface) for the BIOS In-Reply-To: References: Message-ID: <20091124163411.GA13390@nlxdcldnl2.cl.intel.com> On Thu, Nov 12, 2009 at 04:47:42PM -0800, Rahul Nabar wrote: > Has anyone tried out UEFI (Unified Extensible Firmware Interface) in > the BIOS? The new servers I am buying come with this option in the > BIOS. Out of curiosity I googled it up. > > I am not sure if there were any HPC implications of this and wanted to > double check before I switched to this from my conventional > plain-vanilla BIOS. Any sort of "industry standard" always sounds good > but I thought it safer to check on the group first.... Just catching up w/ email. SC derails most normal work... I use UEFI, but, well, that shouldn't be a large surprise. elilo is the boot loader, it's written in C, and quite easy to manage. I actually use an abridged version of elilo to load a Linux kernel/initrd that provides my booting support. You can, for example, put the kernel/initrd normally PXE booted directly on the node and get into that w/o having to deal with PXE, TFTP, et al. Once you get that far, you can then carefully tune the kernel/initrd to exquisitely control the boot process. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From amacater at galactic.demon.co.uk Tue Nov 24 13:40:11 2009 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Tue, 24 Nov 2009 21:40:11 +0000 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124003534.GA15927@bx9.net> References: <20091124003534.GA15927@bx9.net> Message-ID: <20091124214011.GB31909@galactic.demon.co.uk> On Mon, Nov 23, 2009 at 04:35:34PM -0800, Greg Lindahl wrote: > For reasons complicated to explain, I want to run a Fedora kernel on > CentOS 5. Does anyone have any words of wisdom or pointers to webpages > for people who've done this? > > -- greg > > p.s. missed you guys at SC, I was stuck racking 500 servers... > http://www.flickr.com/photos/skrenta/sets/72157622738924345/ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf I hesitate to say this, because I'm talking to someone whose reputation is stellar and because my own biases may show slightly :) Don't - under any circumstances whatever - use Fedora on a production system or a system on which you want to do real work. If you _MUST_ do it - because, for example, your hardware is too new and not yet supported under the Red Hat Enterprise Linux / Centos 5.4 kernel - 2.6.18-164* if I recall correctly - then it _may_ work but it WILL cause you some degree of instability, interesting debugging interaction problems and some hours/days of frustration. Red Hat 5 was based on FC6 or Fedora 8 IIRC. Both now unavailable on the main mirrors. Fedora 9 has 2.6.25 which is quite a jump. It might be worth getting the source RPMs and building the kernels from each of 10,11,12 but on a CentOS machine. RPMForge doesn't seem to have much to help, here :( All best, AndyC From jack at crepinc.com Mon Nov 23 17:17:43 2009 From: jack at crepinc.com (Jack Carrozzo) Date: Mon, 23 Nov 2009 20:17:43 -0500 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124003534.GA15927@bx9.net> References: <20091124003534.GA15927@bx9.net> Message-ID: <2ad0f9f60911231717w20c2277bp7f47827b45ef19c0@mail.gmail.com> Technically, there's no reason it wouldn't work, as it's the same to the OS as if you were just running a vanilla kernel (you're SURE you need the fedora kernel and a vanilla won't do?) Get the kernel source for the Fedora one, drop on CentOS, and go to town as you would a vanilla kernel. Cheers, -Jack Carrozzo On Mon, Nov 23, 2009 at 7:35 PM, Greg Lindahl wrote: > For reasons complicated to explain, I want to run a Fedora kernel on > CentOS 5. Does anyone have any words of wisdom or pointers to webpages > for people who've done this? > > -- greg > > p.s. missed you guys at SC, I was stuck racking 500 servers... > http://www.flickr.com/photos/skrenta/sets/72157622738924345/ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From ralph.mason at gmail.com Tue Nov 24 09:30:24 2009 From: ralph.mason at gmail.com (Ralph Mason) Date: Tue, 24 Nov 2009 09:30:24 -0800 Subject: [Beowulf] Low cost Hi Density - Nehalem clusters Message-ID: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com> We have been building a development cluster using consumer core i7 processors with 12gb of ram each and a ide cf boot disk on a motherboard with embedded everything. This is very cost effective as it uses consumer processors and the cheapest ram (ram prices skyrocket once you go higher than 2gb modules). Does anyone know of any commercial offering that can pack these nodes into a high density rack or offer a similar price performance curve for the given ram and processing power? Thanks Ralph -------------- next part -------------- An HTML attachment was scrubbed... URL: From ed at eh3.com Tue Nov 24 18:46:33 2009 From: ed at eh3.com (Ed Hill) Date: Tue, 24 Nov 2009 21:46:33 -0500 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124214011.GB31909@galactic.demon.co.uk> References: <20091124003534.GA15927@bx9.net> <20091124214011.GB31909@galactic.demon.co.uk> Message-ID: <20091124214633.6b20112c@localhost.localdomain> On Tue, 24 Nov 2009 21:40:11 +0000 "Andrew M.A. Cater" wrote: > > I hesitate to say this, because I'm talking to someone whose > reputation is stellar and because my own biases may show slightly :) > > Don't - under any circumstances whatever - use Fedora on a > production system or a system on which you want to do real work. > > If you _MUST_ do it - because, for example, your hardware is too new > and not yet supported under the Red Hat Enterprise Linux / Centos 5.4 > kernel - 2.6.18-164* if I recall correctly - then it _may_ work but > it WILL cause you some degree of instability, interesting debugging > interaction problems and some hours/days of frustration. Please take a deep breath and lay off the FUD. Fedora is a perfectly capable cluster OS. I know of a half-dozen operational clusters (some with literally hundreds of CPUs) that rely on it for software development, data analysis, production runs, etc., etc. In my experience, a well-managed Fedora-based cluster will have the same stability and usability as any other well-managed Linux cluster. And yes, Fedora does ship with some {leading,bleeding} edge bits which cuts *both* ways -- sometimes its a big help (e.g., kernel support for just-released hardware) and sometimes (e.g.; newer library versions) it can be a bit of a hassle. As others have mentioned, the short Fedora lifetimes are not for everyone. Unless you stumble upon a batch of truly unreliable hardware (which does occasionally happen), the overall utility of a Linux cluster is a *direct* result of the skill and care of those who manage it. Ed -- Edward H. Hill III, PhD | ed at eh3.com | http://eh3.com/ From jlforrest at berkeley.edu Tue Nov 24 19:41:56 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 24 Nov 2009 19:41:56 -0800 Subject: [Beowulf] Low cost Hi Density - Nehalem clusters In-Reply-To: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com> References: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com> Message-ID: <4B0CA784.2060402@berkeley.edu> Ralph Mason wrote: > We have been building a development cluster using consumer core i7 > processors with 12gb of ram each and a ide cf boot disk on a motherboard > with embedded everything. This is very cost effective as it uses > consumer processors and the cheapest ram (ram prices skyrocket once you > go higher than 2gb modules). > > Does anyone know of any commercial offering that can pack these nodes > into a high density rack or offer a similar price performance curve for > the given ram and processing power? I've recently used Finetec (www.finetec.com) to put together a cluster based on the dual-motherboard Supermicro cases. Since each motherboard can hold 2 processors, and each AMD Istanbul processor has 6-cores, I can get 24 cores per rack unit. That's pretty dense. I believe that SuperMicro also makes similar motherboards for Intel processors. Check it out! Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From amjad11 at gmail.com Tue Nov 24 20:42:24 2009 From: amjad11 at gmail.com (amjad ali) Date: Tue, 24 Nov 2009 23:42:24 -0500 Subject: [Beowulf] Low cost Hi Density - Nehalem clusters In-Reply-To: <4B0CA784.2060402@berkeley.edu> References: <772f86d60911240930p22a7b2ddga6bb9f2238a3532a@mail.gmail.com> <4B0CA784.2060402@berkeley.edu> Message-ID: <428810f20911242042w151ab679y82258f85ad15ba2c@mail.gmail.com> Hi, while looking for high density (fat) nodes; one should keep in mind that more number of cores striving simultaneously to access memory, cause memory contention. Infact today CPU cores are fast and scalable but not the memory bandwidth. Experts say that buying a CPU is infact buying "bandwidth" not the "speed". Intel's specialty is QPI (quick path interconnect) --- removed Front Side Bus (FSB), in Nehalem CPUs, as an effort to over come the bottleneck of memory bandwidth. This bottleneck arises when more number of cpu cores, each performing floating point oprations in HPC applications, strive to get access to main memory simultaneously. Also Intel introduced QPI to compete with the AMD's speciality Direct Connect Architecture (DCA) in its latest CPUs. Nehalem Xeon 55xx are "server" class while Nehalem Core i7/i9 are "desktop" class cpus. In server processors Today in Servers latest (Nehalem) Quad core Xeon 55xx (DP/ for dual socket boards) and Xeon 35xx ( UP/ for single socket boards) are alive. Today in Servers Quad core Xeon 54xx (DP/ for dual socket boards) are not so good. And Xeon 53xx (DP/ for dual socket boards) are virtually dead. And Xeon 33xx/32xx ( UP/ for single socket boards) are not so good. Today in Servers latest Quad core and Six Core Opteron 83xx and 84xx (Shanghai -- 3rd generation) are alive. Today in Servers Opterons 13xx/23xx/24xx (Budapest/Barcelona --- 2nd generation) are not so good. There price differences reflect these 'facts'. On Tue, Nov 24, 2009 at 10:41 PM, Jon Forrest wrote: > Ralph Mason wrote: > >> We have been building a development cluster using consumer core i7 >> processors with 12gb of ram each and a ide cf boot disk on a motherboard >> with embedded everything. This is very cost effective as it uses consumer >> processors and the cheapest ram (ram prices skyrocket once you go higher >> than 2gb modules). >> Does anyone know of any commercial offering that can pack these nodes into >> a high density rack or offer a similar price performance curve for the given >> ram and processing power? >> > > I've recently used Finetec (www.finetec.com) to put together > a cluster based on the dual-motherboard Supermicro cases. > Since each motherboard can hold 2 processors, and each AMD > Istanbul processor has 6-cores, I can get 24 cores per rack > unit. That's pretty dense. > > I believe that SuperMicro also makes similar motherboards > for Intel processors. > > Check it out! > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bcostescu at gmail.com Wed Nov 25 03:37:01 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed, 25 Nov 2009 12:37:01 +0100 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124003534.GA15927@bx9.net> References: <20091124003534.GA15927@bx9.net> Message-ID: On Tue, Nov 24, 2009 at 1:35 AM, Greg Lindahl wrote: > For reasons complicated to explain, I want to run a Fedora kernel on > CentOS 5. Does anyone have any words of wisdom or pointers to webpages > for people who've done this? A much newer kernel from Fedora usually requires newer utils as well and this is where it gets hairy. A recent Fedora kernel SRPM probably would not even compile on CentOS 5 (haven't tried lately) and a binary kernel RPM downloaded from a Fedora mirror will certainly not install (but I guess that you've tried that already ;-)). You can try forcing the installation (rpm --nodeps) and watch what breaks - chances are that at least basic functionality will remain; you can also try installing all the dependencies, but at that point you are probably running Fedora with only high-level user apps from CentOS; or something in between where you install only the Fedora kernel and the required few dependencies for the features you are interested in and hope that the rest will remain in some functional form. What do you need from the Fedora kernel ? And why not running the newly released Fedora 12 which will be the base for CentOS 6 anyway shortly ? (for a vague definition of shortly ;-)) Bogdan From pal at di.fct.unl.pt Wed Nov 25 04:24:00 2009 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Wed, 25 Nov 2009 12:24:00 -0000 (WET) Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: References: <20091124003534.GA15927@bx9.net> Message-ID: <26580.193.136.122.19.1259151840.squirrel@webmail.fct.unl.pt> > On Tue, Nov 24, 2009 at 1:35 AM, Greg Lindahl wrote: >> For reasons complicated to explain, I want to run a Fedora kernel on >> CentOS 5. Does anyone have any words of wisdom or pointers to webpages >> for people who've done this? > Just guessing: if "reasons complicated to explain" refers to hardware which is not supported by CentOS, installing Fedora and booting CentOS in a VM guest would be a solution for you? paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: poral at fct.unl.pt 2829-516 Caparica, PORTUGAL From h.nakashima at media.kyoto-u.ac.jp Wed Nov 25 00:03:53 2009 From: h.nakashima at media.kyoto-u.ac.jp (Hiroshi Nakashima) Date: Wed, 25 Nov 2009 17:03:53 +0900 Subject: [Beowulf] [hpc-announce] CFP of ICS'10: Intl. Conf. Supercomputing Message-ID: <163C0DE85F86479B8EE11B6B8945321B@MEDUZA> [Our apologies if you receive multiple copies of this CFP] CALL FOR PAPERS 24th International Conference on Supercomputing (ICS'10) http://www.ics-conference.org June 1-4, 2010 Epochal Tsukuba (Tsukuba International Congress Center) Tsukuba, Japan http://www.epochal.or.jp/eng/ Sponsored by ACM/SIGARCH ICS is the premier international forum for the presentation of research results in high-performance computing systems. In 2010 the conference will be held at the Epochal Tsukuba (Tsukuba International Congress Center) in Tsukuba City, the largest high-tech and academic city in Japan. Papers are solicited on all aspects of research, development, and application of high-performance experimental and commercial systems. Special emphasis will be given to work that leads to better understanding of the implications of the new era of million-scale parallelism and Exa-scale performance; including (but not limited to): * Computationally challenging scientific and commercial applications: studies and experiences to exploit ultra large scale parallelism, a large number of accelerators, and/or cloud computing paradigm. * High-performance computational and programming models: studies and proposals of new models, paradigms and languages for scalable application development, seamless exploitation of accelerators, and grid/cloud computing. * Architecture and hardware aspects: processor, accelerator, memory, interconnection network, storage and I/O architecture to make future systems scalable, reliable and power efficient. * Software aspects: compilers and runtime systems, programming and development tools, middleware and operating systems to enable us to scale applications and systems easily, efficiently and reliably. * Performance evaluation studies and theoretical underpinnings of any of the above topics, especially those giving us perspective toward future generation high-performance computing. * Large scale installations in the Petaflop era: design, scaling, power, and reliability, including case studies and experience reports, to show the baselines for future systems. In order to encourage open discussion on future directions, the program committee will provide higher priority for papers that present highly innovative and challenging ideas. Papers should not exceed 6,000 words, and should be submitted electronically, in PDF format using the ICS'10 submission web site. Submissions should be blind. The review process will include a rebuttal period. Please refer to the ICS'10 web site for detailed instructions. Workshop and tutorial proposals are also be solicited and due by January 18, 2010. For further information and future updates, refer to the ICS'10 web site at http://www.ics-conference.org or contact the General Chair (ics10-chair at hpcs.cs.tsukuba.ac.jp) or Program Co-Chairs (ics10-chairs at ac.upc.edu). Important Dates Abstract submission: January 11, 2010 Paper submission: January 18, 2010 Author notification: March 22, 2010 Final papers: April 15, 2010 For more information, please visit the conference web site at http://www.ics-conference.org [ICS 2010 Committee Members] GENRAL CHAIR Taisuke Boku, U. Tsukuba PROGRAM CO-CHAIRS Hiroshi Nakashima, Kyoto U. Avi Mendelson, Microsoft FINANCE CHAIR Kazuki Joe, Nara Women's U. PUBLICATION CHAIR Osamu Tatebe, U. Tsukuba PUBLICITY CO-CHAIRS Darren Kerbyson, LANL Hironori Nakajo, Tokyo U. Agric. & Tech. Serge Petiton, CNRS/LIFL WORKSHOP & TUTORIAL CHAIR Koji Inoue, Kyushu U. WEB & SUBMISSION CO-CHAIRS Eduard Ayguade, BSC/UPC Alex Ramirez, BSC/UPC LOCAL ARRANGEMENT CHAIR Daisuke Takahashi, U. Tsukuba PROGRAM COMITTEE Jung Ho Ahn, Seoul NU. Eduard Ayguade, BSC/UPC Carl Beckmann, Intel Muli Ben-Yehuda, IBM Gianfranco Bilardi, U. Padova Greg Byrd, NCSU Franck Cappello, INRIA Marcelo Cintra, U. Edinburgh Luiz De Rose, Cray Bronis De Supinski, LLNL/CASC Jack Dongarra, UTenn/ORNL Eytan Frachtenberg, Powerset Research Kyle Gallivan, FSU Stratis Gallopoulos, ,U. Patras Milind Girkar, Intel Bill Gropp, UIUC Mike Heroux, SNL Adolfy Hoisie, LANL Koh Hotta, Fujitsu Yutaka Ishikawa, U. Tokyo Takeshi Iwashita, Kyoto U. Kazuki Joe, Nara Woman's U. Hironori Kasahara, U. Waseda Arun Kejariwal, Yahoo Darren Kerbyson, LANL Moe Khaleel, PNNL Bill Kramer, NCSA Andrew Lewis, Griffith U. Jose Moreira, IBM Walid Najjar, U.C. Riverside Kengo Nakajima, U. Tokyo Hironori Nakajo, Tokyo U. Agric. & Tech. Hiroshi Nakamura, U. Tokyo Toshio Nakatani, IBM Research Tokyo Michael O'Boyle, U. Edinburgh Lenny Oliker, LBNL Theodore Papatheodoro, U. Patras Miquel Pericas, BSC Keshav Pingali, U. Texas Depei Qian, Beihang U. Alex Ramirez, BSC/UPC Valentina Salapura, IBM Mitsuhisa Sato, U. Tsukuba John Shalf, LBNL Takeshi Shimizu, Fujitsu Joshua Simons, Sun Microsystems Shinji Sumimoto, Fujitsu Makoto Taiji, Riken Toshikazu Takada, Riken Daisuke Takahashi, U. Tsukuba Guangming Tan, ICT Osamu Tatebe, U. Tsukuba Kenjiro Taura, U. Tokyo Rajeev Thakur, ANL Rong Tian, NCIC Robert Van Engelen, FSU Harry Wijshoff, Leiden Mitsuo Yokokawa, Riken Ayal Zaks, IBM Yunquan Zhang, ISCAS --------------------------------------------------------------------- Hiroshi Nakashima (h.nakashima at media.kyoto-u.ac.jp) Professor Academic Center for Computing and Media Studies Kyoto University ACCMS North Bldg., Yoshida Hon-machi, Sakyo-ku, Kyoto, 606-8501, JAPAN +81-75-753-7457/7448(F) From agshew at gmail.com Wed Nov 25 10:35:02 2009 From: agshew at gmail.com (Andrew Shewmaker) Date: Wed, 25 Nov 2009 11:35:02 -0700 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091124003534.GA15927@bx9.net> References: <20091124003534.GA15927@bx9.net> Message-ID: On Mon, Nov 23, 2009 at 5:35 PM, Greg Lindahl wrote: > For reasons complicated to explain, I want to run a Fedora kernel on > CentOS 5. Does anyone have any words of wisdom or pointers to webpages > for people who've done this? I'm interested in this sort of thing too. I haven't taken the latest Fedora 12 kernel and put it on a RHEL 5 distro yet, but I have backported the spec file to Fedora 9. Older distros don't checksum the rpms the same way, so the first step is to unpack it with rpm2cpio and then recreate the src rpm. I used vimdiff between the Fedora 12 spec file and a previous Fedora 9 kernel's spec file, and changed things like the list of directories included in the header subpackage. I also remember updating grubby and mkinitrd packages, but I don't recall if that ended up being totally necessary. If I get this done for a RHEL5 distro, then I'll let you know. Likewise, if you get it done, then I'd like to try it. -- Andrew Shewmaker From lindahl at pbm.com Wed Nov 25 11:34:27 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 25 Nov 2009 11:34:27 -0800 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: References: <20091124003534.GA15927@bx9.net> Message-ID: <20091125193427.GE16652@bx9.net> On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote: > A much newer kernel from Fedora usually requires newer utils as well Yeah, last time I did this I didn't rpm-ize the kernel, and that saved me quite a bit of work. I snagged the .config file out of Fedora, but didn't grab any patches. Since you guys seem determined to speculate about what feature I need, it's the new scheduler, CFS, which appeared in 2.6.23. > And why not running the newly released Fedora 12 which will be the > base for CentOS 6 anyway shortly ? (for a vague definition of > shortly ;-)) Fedora on a big production cluster? Not a chance. -- greg From agshew at gmail.com Wed Nov 25 12:14:01 2009 From: agshew at gmail.com (Andrew Shewmaker) Date: Wed, 25 Nov 2009 13:14:01 -0700 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091125193427.GE16652@bx9.net> References: <20091124003534.GA15927@bx9.net> <20091125193427.GE16652@bx9.net> Message-ID: On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl wrote: > On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote: > >> A much newer kernel from Fedora usually requires newer utils as well > > Yeah, last time I did this I didn't rpm-ize the kernel, and that saved > me quite a bit of work. I snagged the .config file out of Fedora, but > didn't grab any patches. > > Since you guys seem determined to speculate about what feature I need, > it's the new scheduler, CFS, which appeared in 2.6.23. In case you were considering the 2.6.25 kernel that shipped with Fedora 9, I recommend against it. I know there have been studies showing that it is a nice kernel with regard to low interrupt noise, but I have regularly seen it lock up while running MPI apps. 2.6.27 hasn't had the same issue, but we have seen 10-20% regression in IP network performance. -- Andrew Shewmaker From Michael.Frese at NumerEx-LLC.com Wed Nov 25 16:31:41 2009 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Wed, 25 Nov 2009 17:31:41 -0700 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: References: <20091124003534.GA15927@bx9.net> <20091125193427.GE16652@bx9.net> Message-ID: <6.2.5.6.2.20091125171548.060750b8@NumerEx-LLC.com> At 01:14 PM 11/25/2009, Andrew Shewmaker wrote: >On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl wrote: > > On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote: > > > >> A much newer kernel from Fedora usually requires newer utils as well > > > > Yeah, last time I did this I didn't rpm-ize the kernel, and that saved > > me quite a bit of work. I snagged the .config file out of Fedora, but > > didn't grab any patches. > > > > Since you guys seem determined to speculate about what feature I need, > > it's the new scheduler, CFS, which appeared in 2.6.23. > >In case you were considering the 2.6.25 kernel that shipped with Fedora 9, >I recommend against it. I know there have been studies showing that it is >a nice kernel with regard to low interrupt noise, but I have regularly seen it >lock up while running MPI apps. 2.6.27 hasn't had the same issue, but we >have seen 10-20% regression in IP network performance. I think I'm seeing 2.6.23 lockup with a big MPI app, but I wouldn't have guessed the connection without Andrew's message. We've been moving away from Fedora toward CentOS 5.4, and thus, back to 2.6.18, but apparently not fast enough. Mike From shaeffer at neuralscape.com Wed Nov 25 20:01:22 2009 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Wed, 25 Nov 2009 20:01:22 -0800 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: References: <20091124003534.GA15927@bx9.net> <20091125193427.GE16652@bx9.net> Message-ID: <20091126040122.GA689@synapse.neuralscape.com> On Wed, Nov 25, 2009 at 01:14:01PM -0700, Andrew Shewmaker wrote: > On Wed, Nov 25, 2009 at 12:34 PM, Greg Lindahl wrote: > > On Wed, Nov 25, 2009 at 12:37:01PM +0100, Bogdan Costescu wrote: > > > In case you were considering the 2.6.25 kernel that shipped with Fedora 9, > I recommend against it. I know there have been studies showing that it is > a nice kernel with regard to low interrupt noise, but I have regularly seen it > lock up while running MPI apps. 2.6.27 hasn't had the same issue, but we > have seen 10-20% regression in IP network performance. Hi, The 2.6.27 kernel had a lot of new networking code in there. And some of the network performance issues carry forward into the 2.6.28 kernel. I would suggest you go to the 2.6.29 kernel. BTW, centos 5 runs a modified ext3 filesystem. So, that is an issue you'll need to come to terms with in moving to other kernels. FYI, fedora core 12 runs the 2.6.31 kernel. Good luck with it. Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From csamuel at vpac.org Wed Nov 25 20:20:14 2009 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 26 Nov 2009 15:20:14 +1100 (EST) Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091126040122.GA689@synapse.neuralscape.com> Message-ID: <14565064.571259209210844.JavaMail.csamuel@sys26> ----- "Karen Shaeffer" wrote: Hiya, > BTW, centos 5 runs a modified ext3 filesystem. So, > that is an issue you'll need to come to terms with > in moving to other kernels. We've not seen any issues running mainline kernels (2.6.30.x at present) with CentOS 5, what issues have you seen with this ? > FYI, fedora core 12 runs the 2.6.31 kernel. 2.6.31 also introduces new (better?) support for MCE's on the k10h family of CPUs (Barcelona, Shanghai), though if you (like us) have scripts that parse /var/log/mcelog you'll need to look at the data in /sys instead. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From d.love at liverpool.ac.uk Thu Nov 26 09:02:05 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 26 Nov 2009 17:02:05 +0000 Subject: [Beowulf] Re: CentOS plus Fedora kernel? References: <20091124003534.GA15927@bx9.net> <20091125193427.GE16652@bx9.net> Message-ID: <87zl69xlrm.fsf@liv.ac.uk> Greg Lindahl writes: > Since you guys seem determined to speculate about what feature I need, > it's the new scheduler, CFS, which appeared in 2.6.23. What's wrong with the Linux from RedHat's HPC/`Grid' offering (whatever it's called), then? It has a 2.6.24 base. I can't try it because of a pestilential proprietary driver, but it seemed really to be more appropriate for compute nodes in various respects (given that I don't have a choice about running a RHEL-based system). From shaeffer at neuralscape.com Thu Nov 26 10:32:38 2009 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Thu, 26 Nov 2009 10:32:38 -0800 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <14565064.571259209210844.JavaMail.csamuel@sys26> References: <20091126040122.GA689@synapse.neuralscape.com> <14565064.571259209210844.JavaMail.csamuel@sys26> Message-ID: <20091126183238.GA9765@synapse.neuralscape.com> On Thu, Nov 26, 2009 at 03:20:14PM +1100, Chris Samuel wrote: > > ----- "Karen Shaeffer" wrote: > > Hiya, > > > BTW, centos 5 runs a modified ext3 filesystem. So, > > that is an issue you'll need to come to terms with > > in moving to other kernels. > > We've not seen any issues running mainline kernels > (2.6.30.x at present) with CentOS 5, what issues have > you seen with this ? Hi Chris, Actually, I haven't tried it recently. But I have recently tried to run a recompiled and reconfigured RHEL 5 kernel on a stock ext3 filesystem. And it crashed almost immediately with filesystem corruption. And the RHEL 5 grub can't read a stock ext3 filesystem either. I appreciate your feedback, because I was wondering about that very question. And my comment only meant to suggest one needed to be aware of the issue. Based on your comments, it appears the RH extension of ext3 is backward compatible with kernel.org kernels. But apparently the RH kernels require the extensions in order to write. Thanks, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From ispmarin at gmail.com Thu Nov 26 16:38:38 2009 From: ispmarin at gmail.com (Ivan Marin) Date: Thu, 26 Nov 2009 22:38:38 -0200 Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI Message-ID: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com> Hello all, I've been following the discussions here in this list for quite a while and always enjoying the discussions, and did some admin work in beowulf clusters. But after a long time far from parallel programming, now for my PhD in groundwater simulation I'm trying again to implement the linear solver pdgesv from Scalapack. I'm having some troubles with the definitions in the function call within C++ and the best data distribution, so I would like to ask: is there anybody on this list developing with Scalapack and C++? Where is the proper place to ask Scalapack questions? It seems that both the forum and the mailing list doesn't have any activity recently. Thank you in advance! Ivan Marin Laborat?rio de Hidr?ulica Computacional - LHC Departamento de Hidr?ulica e Saneamento - SHS Escola de Engenharia de S?o Carlos - EESC Universidade de S?o Paulo - USP http://albatroz.shs.eesc.usp.br +55 16 3373 8270 -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg.matthews at diamond.ac.uk Fri Nov 27 02:08:42 2009 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Fri, 27 Nov 2009 10:08:42 +0000 Subject: [Beowulf] CentOS plus Fedora kernel? In-Reply-To: <20091126183238.GA9765@synapse.neuralscape.com> References: <20091126040122.GA689@synapse.neuralscape.com> <14565064.571259209210844.JavaMail.csamuel@sys26> <20091126183238.GA9765@synapse.neuralscape.com> Message-ID: <4B0FA52A.3050901@diamond.ac.uk> ack... meant this to go to the list: Karen Shaeffer wrote: > On Thu, Nov 26, 2009 at 03:20:14PM +1100, Chris Samuel wrote: >> ----- "Karen Shaeffer" wrote: >> >> Hiya, >> >>> BTW, centos 5 runs a modified ext3 filesystem. So, >>> that is an issue you'll need to come to terms with >>> in moving to other kernels. >> We've not seen any issues running mainline kernels >> (2.6.30.x at present) with CentOS 5, what issues have >> you seen with this ? > > Hi Chris, > Actually, I haven't tried it recently. But I have recently tried > to run a recompiled and reconfigured RHEL 5 kernel on a stock > ext3 filesystem. And it crashed almost immediately with filesystem > corruption. And the RHEL 5 grub can't read a stock ext3 filesystem > either. this is news to me. where can I find info on the modifications that RH are using? Google hasn't turned anything up for me this morning. GREG > > I appreciate your feedback, because I was wondering about that > very question. And my comment only meant to suggest one needed to > be aware of the issue. Based on your comments, it appears the RH > extension of ext3 is backward compatible with kernel.org kernels. > But apparently the RH kernels require the extensions in order to > write. > > Thanks, > Karen > > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK From jellogum at gmail.com Fri Nov 27 21:21:42 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Fri, 27 Nov 2009 21:21:42 -0800 Subject: [Beowulf] ask about mpich In-Reply-To: References: Message-ID: Edited revision of original post without permission from author: "Hello folks; I want to make a cluster system employing the command/function: mpich, in Ubuntu, but I am not too familiar with it. I could use some advice for a problem that a good translation might solve for me. I have followed the instructions [assuming translated into your native language] for mpich, the information related to the cluster that I am using, but it does not work. The project is related to an important work that is my thesis. Help would be greatly appreciated!" What is your native language? On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra < christiansuhendra at gmail.com> wrote: > halo guys i wants to make a cluster system with mpich in ubuntu,,but i have > troubleshooting with mpich.. > but when i run the example program in mpich..it doesn't work in > cluster..but i've registered the node on machine.LINUX.. > but still not working > please help me..this is my thesis... > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.love at liverpool.ac.uk Sun Nov 29 15:01:02 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 29 Nov 2009 23:01:02 +0000 Subject: [Beowulf] Re: CentOS plus Fedora kernel? References: <20091126040122.GA689@synapse.neuralscape.com> <14565064.571259209210844.JavaMail.csamuel@sys26> <20091126183238.GA9765@synapse.neuralscape.com> <4B0FA52A.3050901@diamond.ac.uk> Message-ID: <87bpillyvl.fsf@liv.ac.uk> Gregory Matthews writes: >> Actually, I haven't tried it recently. But I have recently tried >> to run a recompiled and reconfigured RHEL 5 kernel on a stock >> ext3 filesystem. And it crashed almost immediately with filesystem >> corruption. And the RHEL 5 grub can't read a stock ext3 filesystem >> either. > > this is news to me. where can I find info on the modifications that RH > are using? Google hasn't turned anything up for me this morning. It sounds implausible, and the RH grub has no patches that mention ext3 in their name, but I'm not sure I've actually booted that way round. >> But apparently the RH kernels require the extensions in order to >> write. Experimentally an ext3 filesystem made on SuSE, seems fine for i/o from RH 5.4. From agshew at gmail.com Mon Nov 30 09:15:38 2009 From: agshew at gmail.com (Andrew Shewmaker) Date: Mon, 30 Nov 2009 10:15:38 -0700 Subject: [Beowulf] Re: CentOS plus Fedora kernel? In-Reply-To: <87zl69xlrm.fsf@liv.ac.uk> References: <20091124003534.GA15927@bx9.net> <20091125193427.GE16652@bx9.net> <87zl69xlrm.fsf@liv.ac.uk> Message-ID: On Thu, Nov 26, 2009 at 10:02 AM, Dave Love wrote: >> Since you guys seem determined to speculate about what feature I need, >> it's the new scheduler, CFS, which appeared in 2.6.23. > > What's wrong with the Linux from RedHat's HPC/`Grid' offering (whatever > it's called), then? ?It has a 2.6.24 base. ?I can't try it because of a > pestilential proprietary driver, but it seemed really to be more > appropriate for compute nodes in various respects (given that I don't > have a choice about running a RHEL-based system). I had assumed that they were using the same 2.6.18 base, just with more patches. I might try that out, but part of the reason I want to use the latest kernel is that I want to do a better job of providing feedback to the kernel developers. -- Andrew Shewmaker From gus at ldeo.columbia.edu Mon Nov 30 10:24:09 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 30 Nov 2009 13:24:09 -0500 Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI In-Reply-To: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com> References: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com> Message-ID: <4B140DC9.5020100@ldeo.columbia.edu> Hi Ivan PETSc (short for Portable Extensible Toolkit for Scientific Computation, from Argonne Natl. Lab.) gives you the ability to use Scalapack and provides a number of linear and PDE solvers: http://www.mcs.anl.gov/petsc/petsc-as/ It builds on top of MPI, BLAS and LAPACK, but you can bind it to a variety of Linear Algebra and other packages, including Scalapack. IIRR, PETSc's default MPI is MPICH2, which is how I built it here a while ago. However, I think it can be built with OpenMPI as well. See this FAQ: http://www.open-mpi.org/faq/?category=mpi-apps#petsc PETSc has C, Fortran, C++, and Python APIs. Some people here used PETSc very successfully (problems were solved, theses were written, PhDs were awarded, papers were published) on global ocean circulation inverse problems, magma migration modeling (i.e., reactive fluid flow in porous media, sounds familiar?), etc. PETSc is certainly good for prototyping, although there is a learning curve. As for efficiency and production codes, I don't know, but you can check their FAQ: http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html and mailing list, where you may also find some useful information about your questions and specific problem: https://lists.mcs.anl.gov/mailman/listinfo/petsc-users http://lists.mcs.anl.gov/pipermail/petsc-users/ In case you don't know, the (very simple) Scalapack home page is on Netlib site: http://www.netlib.org/scalapack/scalapack_home.html My two cents. Boa sorte! Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Ivan Marin wrote: > Hello all, > > I've been following the discussions here in this list for quite a while > and always enjoying the discussions, and did some admin work in beowulf > clusters. But after a long time far from parallel programming, now for > my PhD in groundwater simulation I'm trying again to implement the > linear solver pdgesv from Scalapack. I'm having some troubles with the > definitions in the function call within C++ and the best data > distribution, so I would like to ask: is there anybody on this list > developing with Scalapack and C++? Where is the proper place to ask > Scalapack questions? It seems that both the forum and the mailing list > doesn't have any activity recently. > > Thank you in advance! > > Ivan Marin > > Laborat?rio de Hidr?ulica Computacional - LHC > Departamento de Hidr?ulica e Saneamento - SHS > Escola de Engenharia de S?o Carlos - EESC > Universidade de S?o Paulo - USP > > http://albatroz.shs.eesc.usp.br > +55 16 3373 8270 > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Mon Nov 30 12:15:42 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Mon, 30 Nov 2009 15:15:42 -0500 Subject: [Beowulf] Parallel programming using Scalapack and OpenMPI In-Reply-To: <428810f20911301147g20049d51x9a9606a27ec950ff@mail.gmail.com> References: <751c63ee0911261638v6bf2e136y26df1c734413615f@mail.gmail.com> <4B140DC9.5020100@ldeo.columbia.edu> <428810f20911301147g20049d51x9a9606a27ec950ff@mail.gmail.com> Message-ID: <4B1427EE.3010000@ldeo.columbia.edu> Hi Amjad amjad ali wrote: > Hi, > Please explain in detail about: > > PETSc is certainly good for prototyping, > although there is a learning curve. > > What is meant by learning curve. [?] http://en.wikipedia.org/wiki/Learning_curve Google is your friend! Wikipedia is your friend! IHIH Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > > > > On Mon, Nov 30, 2009 at 1:24 PM, Gus Correa > wrote: > > Hi Ivan > > PETSc (short for > Portable Extensible Toolkit for Scientific Computation, > from Argonne Natl. Lab.) > gives you the ability to use Scalapack > and provides a number of linear and PDE solvers: > > http://www.mcs.anl.gov/petsc/petsc-as/ > > It builds on top of MPI, BLAS and LAPACK, > but you can bind it to a variety of Linear Algebra and > other packages, including Scalapack. > > IIRR, PETSc's default MPI is MPICH2, which is how > I built it here a while ago. > However, I think it can be built with OpenMPI as well. > See this FAQ: > http://www.open-mpi.org/faq/?category=mpi-apps#petsc > > PETSc has C, Fortran, C++, and Python APIs. > > Some people here used PETSc very successfully > (problems were solved, theses were written, > PhDs were awarded, papers were published) > on global ocean circulation inverse problems, > magma migration modeling (i.e., reactive fluid > flow in porous media, sounds familiar?), etc. > > PETSc is certainly good for prototyping, > although there is a learning curve. > As for efficiency and production codes, I don't know, > but you can check their FAQ: > > http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html > > and mailing list, > where you may also find some useful information about > your questions and specific problem: > > https://lists.mcs.anl.gov/mailman/listinfo/petsc-users > http://lists.mcs.anl.gov/pipermail/petsc-users/ > > In case you don't know, the (very simple) > Scalapack home page is on Netlib site: > > http://www.netlib.org/scalapack/scalapack_home.html > > > My two cents. > > Boa sorte! > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > Ivan Marin wrote: > > Hello all, > > I've been following the discussions here in this list for quite > a while and always enjoying the discussions, and did some admin > work in beowulf clusters. But after a long time far from > parallel programming, now for my PhD in groundwater simulation > I'm trying again to implement the linear solver pdgesv from > Scalapack. I'm having some troubles with the definitions in the > function call within C++ and the best data distribution, so I > would like to ask: is there anybody on this list developing with > Scalapack and C++? Where is the proper place to ask Scalapack > questions? It seems that both the forum and the mailing list > doesn't have any activity recently. > > Thank you in advance! > > Ivan Marin > > Laborat?rio de Hidr?ulica Computacional - LHC > Departamento de Hidr?ulica e Saneamento - SHS > Escola de Engenharia de S?o Carlos - EESC > Universidade de S?o Paulo - USP > > http://albatroz.shs.eesc.usp.br > +55 16 3373 8270 > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From amjad11 at gmail.com Mon Nov 30 12:24:34 2009 From: amjad11 at gmail.com (amjad ali) Date: Mon, 30 Nov 2009 15:24:34 -0500 Subject: [Beowulf] MPI Processes + Auto Vectorization Message-ID: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> Hi, Suppose we run a parallel MPI code with 64 processes on a cluster, say of 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node. Now all the 64 cores on the cluster running a process. Program is SPMD, means all processes has the same workload. Now if we had done auto-vectorization while compiling the code (for example with Intel compilers); Will there be any benefit (efficiency/scalability improvement) of having code with the auto-vectorization? Or we will get the same performance as without Auto-vectorization in this example case? How can we really get benefit in performance improvement with Auto-Vectorization? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dnlombar at ichips.intel.com Mon Nov 30 14:50:24 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Mon, 30 Nov 2009 14:50:24 -0800 Subject: [Beowulf] MPI Processes + Auto Vectorization In-Reply-To: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> Message-ID: <20091130225024.GA28311@nlxdcldnl2.cl.intel.com> On Mon, Nov 30, 2009 at 01:24:34PM -0700, amjad ali wrote: > Hi, > Suppose we run a parallel MPI code with 64 processes on a cluster, say of 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node. > > Now all the 64 cores on the cluster running a process. Program is SPMD, means all processes has the same workload. > > Now if we had done auto-vectorization while compiling the code (for example with Intel compilers); Will there be any benefit (efficiency/scalability improvement) of having code with the auto-vectorization? Or we will get the same performance as without Auto-vectorization in this example case? > > How can we really get benefit in performance improvement with Auto-Vectorization? Vectorization takes advantage of the processor's vector instructions to increase data-level parallelism. How much that benefits your code depends very much on your code; you would need to recompile your code and test. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From amjad11 at gmail.com Mon Nov 30 22:14:13 2009 From: amjad11 at gmail.com (amjad ali) Date: Tue, 1 Dec 2009 01:14:13 -0500 Subject: [Beowulf] MPI Processes + Auto Vectorization In-Reply-To: <20091130225024.GA28311@nlxdcldnl2.cl.intel.com> References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> <20091130225024.GA28311@nlxdcldnl2.cl.intel.com> Message-ID: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com> Hi, perhaps I could not better ask my question. My question is that if we do not have free cpu cores in a PC or cluster (all cores are running MPI processes), still the auto-vertorization is beneficial? Or it is beneficial only if we have some free cpu cores locally? thanks On Mon, Nov 30, 2009 at 5:50 PM, David N. Lombard wrote: > On Mon, Nov 30, 2009 at 01:24:34PM -0700, amjad ali wrote: > > Hi, > > Suppose we run a parallel MPI code with 64 processes on a cluster, say of > 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node. > > > > Now all the 64 cores on the cluster running a process. Program is SPMD, > means all processes has the same workload. > > > > Now if we had done auto-vectorization while compiling the code (for > example with Intel compilers); Will there be any benefit > (efficiency/scalability improvement) of having code with the > auto-vectorization? Or we will get the same performance as without > Auto-vectorization in this example case? > > > > How can we really get benefit in performance improvement with > Auto-Vectorization? > > Vectorization takes advantage of the processor's vector instructions to > increase data-level parallelism. > How much that benefits your code depends very much on your code; you would > need to recompile your code and test. > > -- > David N. Lombard, Intel, Irvine, CA > I do not speak for Intel Corporation; all comments are strictly my own. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From h-bugge at online.no Mon Nov 30 23:54:54 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Tue, 1 Dec 2009 08:54:54 +0100 Subject: [Beowulf] MPI Processes + Auto Vectorization In-Reply-To: <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com> References: <428810f20911301224u64783b27q465a1bc73918849@mail.gmail.com> <20091130225024.GA28311@nlxdcldnl2.cl.intel.com> <428810f20911302214x4b85a07du4684bbb57f60a72b@mail.gmail.com> Message-ID: On Dec 1, 2009, at 7:14 , amjad ali wrote: > My question is that if we do not have free cpu cores in a PC or > cluster (all cores are running MPI processes), still the auto- > vertorization is beneficial? Or it is beneficial only if we have > some free cpu cores locally? Amjad, Vectorization is in x86_64 parlor a compilation technique where the compiler will utilize certain instructions which operate on short vectors. When you execute such a program on a particular core, these vector-instructions will execute on special execution unit _within_ the core you're executing on. Hence, no additional resources or cores are required to use vector instructions and you will benefit from them independent of whether you fully use all cores in your cluster or not. H?kon -------------- next part -------------- An HTML attachment was scrubbed... URL: From rjtucke at gmail.com Wed Nov 25 13:53:49 2009 From: rjtucke at gmail.com (Ross Tucker) Date: Wed, 25 Nov 2009 14:53:49 -0700 Subject: [Beowulf] New member, upgrading our existing Beowulf cluster Message-ID: <2f30dc950911251353i4a5dfacay6378a655cf0feda7@mail.gmail.com> Greetings! I'm a new member to this list, but the research group that I work for has had a working cluster for many years. I am now looking at upgrading our current configuration. I was wondering if anyone has actual experience with running more than one node from a single power supply. Even just two boards on one PSU would be nice. We will be using barely 200W per node for 50 nodes and it just seems like a big waste to buy 50 power supply units. I have read the old posts but did not see any reports of success. Best regards, Ross Tucker Ariz State Univ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lm.moreira at gmail.com Thu Nov 26 10:15:58 2009 From: lm.moreira at gmail.com (Leonardo Machado Moreira) Date: Thu, 26 Nov 2009 16:15:58 -0200 Subject: [Beowulf] Cluster Users in Clusters Linux and Windows Message-ID: <4788ffe70911261015t2817fcd4i55044d692b1aed64@mail.gmail.com> Hi! I am trying to create a cluster with only two machines. The server will be a Linux machine, an Arch Linux distribution to be more specific. The slave machine will be a Windows 7 machine. I have found it is possible, but I was looking and have found that each machine on the cluster must have the same user for the cluster. I was wondering how would I deal with it with the windows machine ? Do I have do implement a specific program in it? Would it found the rsh ? Thanks in advance! Leonardo Machado Moreira -------------- next part -------------- An HTML attachment was scrubbed... URL: From toon.knapen at gmail.com Thu Nov 26 12:46:34 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Thu, 26 Nov 2009 21:46:34 +0100 Subject: [Beowulf] rhel hpc Message-ID: Dear all, I've been working on hpux-itanium for the last 2 years (and even unsubscribed to beowulf-ml during most of that time, my bad) but soon will turn back to a beowulf cluster (HP DL380G6's with Xeon X5560, amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few questions on the config. 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1 instead of going to the latest make a difference. 2) What are the advantages of the hpc version of rhel. I browsed the doc but unless having to compile mpi myself I do not see a difference or did I miss soth. 3) which filesystem is advisable knowing that we're calculating on large berkeley db databases thanks in advance, toon -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzaletnev at yandex.ru Fri Nov 27 23:27:04 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Sat, 28 Nov 2009 10:27:04 +0300 Subject: [Beowulf] ask about mpich Message-ID: <141481259393224@webmail111.yandex.ru> MPICH2 condensed instructions: mpd daemon ring setup (MPICH2 installed supposed) 1. At all nodes: cd $HOME touch .mpd.conf chmod 600 .mpd.conf nano .mpd.conf (if there's no nano - aptitude install nano) Enter in the file: MPD_SECRETWORD=_secretword_ _secretword_ must be the same at all nodes 2. At all nodes in /etc/hosts as root enter all nodes IP's, for the current node 127.0.0.1 change by its actual IP. 3. At head node run: mpd & mpdtrace -l Get _host_ _ _port_ 4. At slave nodes run mpd -h _host_ -p _port_ & 5. See if the daemon ring started by running mpdtrace Running mpiexec (2 examples) 1. mpiexec -machinefile /home/user/mfile -np 4 -wdir /home/user ./_YourCode_ The content of the file mfile: Slave0:2 Slave1:2 2 - number of cores 2. mpiexec -genv FOO BAR -n 2 -host Slave0 a.out : -n 2 -host Slave1 b.out FOO - environment variable BAR - its value a.out and b.out - executables. > Edited revision of original post without permission from author: > "Hello folks; > I want to make a cluster system employing the command/function: mpich, in Ubuntu, but I am not too familiar with it. I could use some advice for a problem that a good translation might solve for me. > I have followed the instructions [assuming translated into your native language] for mpich, the information related to the cluster that I am using, but it does not work. The project is related to an important work that is my thesis. Help would be greatly appreciated!" > What is your native language? > On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra wrote: > > halo guys i wants to make a cluster system with mpich in ubuntu,,but i have troubleshooting with mpich.. > > but when i run the example program in mpich..it doesn't work in cluster..but i've registered the node on machine.LINUX.. > > but still not working > > please help me..this is my thesis... > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > ??????.?????. ?????? ????. ????? - ???. http://mail.yandex.ru/nospam/sign From dmitri.chubarov at gmail.com Sat Nov 28 00:55:10 2009 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Sat, 28 Nov 2009 14:55:10 +0600 Subject: [Beowulf] ask about mpich In-Reply-To: References: Message-ID: Hello, Christian, you probably will need some advice off the list with your mpich setup. If you tell the list where you are located and studying, you might receive the help you need from someone who speaks your language or is located nearby since the audience of Beowulf list is indeed very widely dispersed around the globe. Also you might want to send a more detailed description of your problems to mpich-discuss mailing list. Jeremy, judging by the name of the original poster, the native language is probably Indonesian. Best regards, Dima On Sat, Nov 28, 2009 at 11:21 AM, Jeremy Baker wrote: > Edited revision of original post without permission from author: > > "Hello folks; > > I want to make a cluster system employing the command/function: mpich, in > Ubuntu, but I am not too familiar with it. I could use some advice for a > problem that a good translation might solve for me. > > I have followed the instructions [assuming translated into your native > language] for mpich, the information related to the cluster that I am using, > but it does not work. The project is related to an important work that is my > thesis. Help would be greatly appreciated!" > > > What is your native language? > > > > On Thu, Nov 12, 2009 at 4:33 AM, christian suhendra > wrote: >> >> halo guys i wants to make a cluster system with mpich in ubuntu,,but i >> have troubleshooting with mpich.. >> but when i run the example program in mpich..it doesn't work in >> cluster..but i've registered the node on machine.LINUX.. >> but still not working >> please help me..this is my thesis... >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From toon.knapen at gmail.com Mon Nov 30 01:35:47 2009 From: toon.knapen at gmail.com (Toon Knapen) Date: Mon, 30 Nov 2009 10:35:47 +0100 Subject: [Beowulf] Fwd: rhel hpc In-Reply-To: References: Message-ID: Dear all, I've been working on hpux-itanium for the last 2 years (and even unsubscribed to beowulf-ml during most of that time, my bad) but soon will turn back to a beowulf cluster (HP DL380G6's with Xeon X5560, amcc/3ware 9690SA-8i with 4 x 600GB Cheetah 15krpm). Now I have a few questions on the config. 1) our company is standardised on RHEL 5.1. Would sticking with rhel 5.1 instead of going to the latest make a difference. 2) What are the advantages of the hpc version of rhel. I browsed the doc but unless having to compile mpi myself I do not see a difference or did I miss soth. 3) which filesystem is advisable knowing that we're calculating on large berkeley db databases thanks in advance, toon -------------- next part -------------- An HTML attachment was scrubbed... URL: