From ebiederm at xmission.com Sat Jan 2 11:43:34 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 02 Jan 2010 11:43:34 -0800 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <20091216080118.GB8679@bx9.net> (Greg Lindahl's message of "Wed\, 16 Dec 2009 00\:01\:18 -0800") References: <4B286517.10009@myri.com> <20091216080118.GB8679@bx9.net> Message-ID: Greg Lindahl writes: > On Tue, Dec 15, 2009 at 11:41:59PM -0500, Patrick Geoffray wrote: > >> So, instead of requiring ~4K per port minimum, you need about ~20K per >> port. Add to that up to 8 priorities with DCB and the buffering >> requirement are quickly getting out of hand. > > Don't worry, switch vendors will simply implement it all poorly, just > like InfiniBand. That's what always happens with overly-complicated > QOS schemes. The way I have heard per priority flow control expected to be used is one priority does pause frames ( for fcoe ) and the other priorities drop frames, as dropping frames when congested avoids head of line blocking, and thus gives better network performance overall. Of course in practice at 10G type speeds you want enough switch buffers so you don't normally drop packets, but that is a different story. Eric From sdi at cs.hku.hk Sun Jan 3 08:13:56 2010 From: sdi at cs.hku.hk (Di sheng) Date: Mon, 4 Jan 2010 00:13:56 +0800 Subject: [Beowulf] [hpc-announce] CCGrid 2010: CALL FOR POSTERS Message-ID: <47e012df1001030813u1331402eg7233364171ef724a@mail.gmail.com> --------------------------------------------------------------------------- We apologize if you receive multiple copies of this CFP --------------------------------------------------------------------------- *********************************************************************** * CCGrid 2010: CALL FOR POSTERS * Poster submission deadline extended to 18 January 2010 * * The 10th IEEE/ACM International Symposium on * Cluster, Cloud and Grid Computing (CCGrid 2010) * May 17-20, 2010, Melbourne, Victoria, Australia * URL: http://www.manjrasoft.com/ccgrid2010/ ********************************************************************** We invite participants to submit a poster to the CCGrid 2010 conference. CCGrid aims at presenting the latest breakthroughs in Cluster, Grid and Cloud technologies for both academic and industry professionals. The submission areas of interest match that of the conference, and include the following areas: * Cluster technologies * Grid Architectures and Systems * Utility Computing Models for Clusters and Grids * Grid Economies and Service Architectures * Service Composition and Orchestration * Middleware for Clusters and Grids * Parallel and Wide-Area File Systems * Peer-to-Peer Systems * Cloud Computing * Community and collaborative computing networks * Grid Trust and Security * Support for Autonomic Grid Infrastructure * Resource Management * Scheduling and Load Balancing * Programming Models, Tools, and Environments * Performance Evaluation and Modeling * Grid-based Problem Solving Environments * Scientific, Engineering, and Commercial Applications Posters can be submitted in one of two ways: 1. Proceedings published posters: Participants submitting proceedings published posters are required to submit a 2-page short paper describing the poster content, research, relevance and importance to the cluster, grid and cloud computing community. If accepted, these 2-page short paper will be published in the proceedings of the conference. 2. Web published posters: These posters require a short 1-page abstract to be submitted. These abstracts will not be included in the conference proceedings, but will be published on the conference website. For both forms of posters, participants will be able to display the poster during the conference and give a short presentation about their poster. Important dates and guidelines for poster submission can be found below: Deadline for proceeding published posters : Jan 18th Notification of Acceptance : Feb 5th Final Version Due : Feb 12th Deadline for web published posters : Apr 1st Two awards, (a) best poster award and (b) best poster presentation award, will be presented to two posters, as sponsored by ManjraSoft. Posters Committee: ================= Ahmad Afsahi, Queen's University David Bernholdt, Oak Ridge National Laboratory Darius Buntinas, Argonne National Laboratory Yong Chen, Illinois Institute of Technology Mark Gardner, Virginia Tech Torsten Hoefler, Indiana University Hyun-wook Jin, Konkuk University Nicholas Karonis, Northern Illinois University Zhiling Lan, Illinois Institute of Technology Jiuxing Liu, IBM Research Scott Pakin, Los Alamos National Laboratory Sayantan Sur, IBM Research Abhinav Vishnu, Pacific Northwest National Laboratory Venkatram Vishwanath, Argonne National Laboratory From mdidomenico4 at gmail.com Mon Jan 4 17:54:48 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 4 Jan 2010 20:54:48 -0500 Subject: [Beowulf] cisco networking Message-ID: its my understanding that you can only have one lacp/etherchannel link between two cisco switches and that link can only comprise up to eight actual links which using 10Gbps links would yield around ~8GB/sec Does anyone know if it's possible to get more? Between two Cisco 6500's? Im pretty sure the answer is no, but i figured i'd check... From doseyg at r-networks.net Sat Jan 9 22:29:12 2010 From: doseyg at r-networks.net (Glen Dosey) Date: Sun, 10 Jan 2010 01:29:12 -0500 Subject: [Beowulf] cisco networking In-Reply-To: References: Message-ID: <1263104952.5838.37.camel@eclipse.office.r-networks.net> You can have multiple ether-channel links between 2 switches. The limitation is that if they are in the same layer 2 broadcast domain (VLAN) only 1 will be active and STP will block the other. If you place each switch in a separate network you could use equal cost load balancing across multiple point to point SVI's on ether-channel links. You'll increase bandwidth at the cost of a little additional latency. Of course at the bandwidths you are talking about (160 Gbit/s or more ) I have no idea how it would really work. Backplane bandwidth, hashing algorithms and all sorts of other factor come into play and could cause it to fail completely. I'd love to try it out but I can't help but think there is probably a better architectural solution. What are you trying to do ? On Mon, 2010-01-04 at 20:54 -0500, Michael Di Domenico wrote: > its my understanding that you can only have one lacp/etherchannel link > between two cisco switches and that link can only comprise up to eight > actual links > > which using 10Gbps links would yield around ~8GB/sec > > Does anyone know if it's possible to get more? Between two Cisco 6500's? > > Im pretty sure the answer is no, but i figured i'd check... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mdidomenico4 at gmail.com Mon Jan 11 06:58:24 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 11 Jan 2010 09:58:24 -0500 Subject: [Beowulf] cisco networking In-Reply-To: <1263104952.5838.37.camel@eclipse.office.r-networks.net> References: <1263104952.5838.37.camel@eclipse.office.r-networks.net> Message-ID: On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey wrote: > You can have multiple ether-channel links between 2 switches. The > limitation is that if they are in the same layer 2 broadcast domain > (VLAN) only 1 will be active and STP will block the other. > > If you place each switch in a separate network you could use equal cost > load balancing across multiple point to point SVI's on ether-channel > links. You'll increase bandwidth at the cost of a little additional > latency. Of course at the bandwidths you are talking about (160 Gbit/s > or more ) I have no idea how it would really work. Backplane bandwidth, > hashing algorithms and all sorts of other factor come into play and > could cause it to fail completely. I'd love to try it out but I can't > help but think there is probably a better architectural solution. > > What are you trying to do ? The attempt is really to get a bunch of devices which only have ethernet 1Gbps connectivity, to talk to a bunch of other machines with Infiniband Connectivity. (Ie, compute nodes to storage). But we want the aggregate bandwidth to be over 5GBps. I was trying to see if there was a simple solution using some of the existing stuff we have, without having to get overly creative with the network or purchase 10G Ethernet to IB routers... Looks like IB routers is the only solution that really works for me... From Greg at keller.net Mon Jan 11 12:53:47 2010 From: Greg at keller.net (Greg Keller) Date: Mon, 11 Jan 2010 14:53:47 -0600 Subject: [Beowulf] cisco networking In-Reply-To: <201001112000.o0BK07NA020635@bluewest.scyld.com> References: <201001112000.o0BK07NA020635@bluewest.scyld.com> Message-ID: <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> > > Date: Mon, 11 Jan 2010 09:58:24 -0500 > From: Michael Di Domenico > Subject: Re: [Beowulf] cisco networking > To: Glen Dosey > Cc: Beowulf at beowulf.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey > wrote: >> You can have multiple ether-channel links between 2 switches. The >> limitation is that if they are in the same layer 2 broadcast domain >> (VLAN) only 1 will be active and STP will block the other. >> >> If you place each switch in a separate network you could use equal >> cost >> load balancing across multiple point to point SVI's on ether-channel >> links. You'll increase bandwidth at the cost of a little additional >> latency. Of course at the bandwidths you are talking about (160 >> Gbit/s >> or more ) I have no idea how it would really work. Backplane >> bandwidth, >> hashing algorithms and all sorts of other factor come into play and >> could cause it to fail completely. I'd love to try it out but I can't >> help but think there is probably a better architectural solution. >> >> What are you trying to do ? > > The attempt is really to get a bunch of devices which only have > ethernet 1Gbps connectivity, to talk to a bunch of other machines with > Infiniband Connectivity. (Ie, compute nodes to storage). But we want > the aggregate bandwidth to be over 5GBps. > > I was trying to see if there was a simple solution using some of the > existing stuff we have, without having to get overly creative with the > network or purchase 10G Ethernet to IB routers... > > Looks like IB routers is the only solution that really works for me... > > > ------------------------------ > Remember that you can actually use nodes to route IP between GbE and IB, just don't expect them to run at IB wirespeed. This can at least save you from paying a premium for proprietary hardware in order to test functionality. Last I tried 3-4 Gbit was about as fast as IP over IB would get through the Interface presumably due to IP overhead... but that was long ago so if you try it I'd love to hear your results. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmdavis1 at vcu.edu Mon Jan 11 14:19:56 2010 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon, 11 Jan 2010 17:19:56 -0500 Subject: [Beowulf] cisco networking In-Reply-To: <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> References: <201001112000.o0BK07NA020635@bluewest.scyld.com> <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> Message-ID: <4B4BA40C.9090607@vcu.edu> Greg Keller wrote: > >> >> Date: Mon, 11 Jan 2010 09:58:24 -0500 >> From: Michael Di Domenico > > >> Subject: Re: [Beowulf] cisco networking >> To: Glen Dosey > >> Cc: Beowulf at beowulf.org >> Message-ID: >> > > >> Content-Type: text/plain; charset=ISO-8859-1 >> >> On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey > > wrote: >>> You can have multiple ether-channel links between 2 switches. The >>> limitation is that if they are in the same layer 2 broadcast domain >>> (VLAN) only 1 will be active and STP will block the other. >>> >>> If you place each switch in a separate network you could use equal cost >>> load balancing across multiple point to point SVI's on ether-channel >>> links. You'll increase bandwidth at the cost of a little additional >>> latency. Of course at the bandwidths you are talking about (160 Gbit/s >>> or more ) I have no idea how it would really work. Backplane bandwidth, >>> hashing algorithms and all sorts of other factor come into play and >>> could cause it to fail completely. I'd love to try it out but I can't >>> help but think there is probably a better architectural solution. >>> >>> What are you trying to do ? >> >> The attempt is really to get a bunch of devices which only have >> ethernet 1Gbps connectivity, to talk to a bunch of other machines with >> Infiniband Connectivity. (Ie, compute nodes to storage). But we want >> the aggregate bandwidth to be over 5GBps. >> >> I was trying to see if there was a simple solution using some of the >> existing stuff we have, without having to get overly creative with the >> network or purchase 10G Ethernet to IB routers... >> >> Looks like IB routers is the only solution that really works for me... >> >> >> - Cisco switched can be connected via trunked links. This allows you to use multiple links to increase the bandwidth. The limit used to be 3 but that may have changed. You can check the switche's online documentation for more information. From forum.san at gmail.com Mon Jan 11 23:19:47 2010 From: forum.san at gmail.com (Sangamesh B) Date: Tue, 12 Jan 2010 12:49:47 +0530 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly Message-ID: Hi HPC experts, I seek your advise/suggestion to resolve a storage(NAS) server' repeated hanging problem. We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of capacity 12 TB is connected to a management server Sun Fire X4150 installed with RHEL 5.3 and this server is connected to a Gigabit switch which provides cluster private network. The home directories on the cluster are NFS mounted from storage partitions across all nodes including the master. This server gets hanged repeatedly. As an initial troubleshooting we installed Ganglia, to check network utilization. But its normal. We're not getting how to troubleshoot it and resolve the problem. Can anybode help us resolve this issue? Thanks, Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Tue Jan 12 01:47:18 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 12 Jan 2010 09:47:18 -0000 Subject: [Beowulf] Need some advise: Sun storage' management server hangsrepeatedly In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0ED811A1@milexchmb1.mil.tagmclarengroup.com> ?? This server gets hanged repeatedly. As an initial troubleshooting we installed Ganglia, to check network utilization. But its normal. We're not getting how to troubleshoot it and resolve the problem. Can anybode help us resolve this issue? Some tools you need: ping -f on the clients (ie flood pings) tcpdump on both clients and server nfsstat on both client and server iostat on the server Can you be a little bit clearer on what you mean by a "hang" Do you see messages about NFS server in the /var/log/messages on the client? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From rpnabar at gmail.com Tue Jan 12 23:06:25 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 01:06:25 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? Message-ID: I always took it as natural to keep all compute nodes on a private switch and assigned them local I/P addresses. This was almost axiomatic for an HPC application in my mind. This way I can channel all traffic to the world and logins while a select login-node. Then firewall the login nodes carefully. Just today, though, on a new project the admin said he always keeps his compute nodes with public I/Ps and runs individual firewalls on them. This seemed just so wrong to me in so many ways but i was curious if there are legitimate reasons why people might do this? Just curious. -- Rahul From beat at 0x1b.ch Tue Jan 12 23:37:50 2010 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 13 Jan 2010 08:37:50 +0100 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: Message-ID: Hello! Quoting (13.01.10 08:06): > This seemed just so wrong to me in so many ways but i was curious if > there are legitimate reasons why people might do this? Just curious. I see both approaches. Even the private LAN is the more common solution. There are applications which needs interaction with some graphical frontend on the workstation of the user. Other reasons are braindead license servers which are not NATable. Like the ones used by Catia or LS-DYNA. Management could be much easier when the administrator is able to contact every device directly from his workstation. Of course all of those examples won't need public IPs. A range of campus or company wide routed private IPs is good enough. Remeber 2010 is the last year where IANA is able to provide IP space :-) The private LAN has the big advantage of beeing a "protected zone". Usually located in a locked datacenter. Exporting NFS or any kind od cluster filesystem to the whole subnet is much, much easier then using dedicated exports or netgroups for each node. Several cluster related tools are not filtering requests and are vulnerable by spoofing attacks. I mainly think of Ganglia or syslogd which accepts any UDP package sent to them. Opening the cluster LAN means always an additional effort to keep the system secure. So both approaches makes sense. It depends on your needs and your existing environment. And also on your experience in system and network security. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From eugen at leitl.org Wed Jan 13 01:37:41 2010 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 13 Jan 2010 10:37:41 +0100 Subject: [Beowulf] Sun/AMD HPC for Dummies ebook now as PDF Message-ID: <20100113093741.GL17686@leitl.org> http://www.sun.com/x64/ebooks/hpc.jsp HPC for Dummies HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From tegner at renget.se Wed Jan 13 03:40:41 2010 From: tegner at renget.se (tegner at renget.se) Date: Wed, 13 Jan 2010 12:40:41 +0100 Subject: [Beowulf] Parallel file systems Message-ID: While starting to investigating different storage solutions I came across gluster (www.gluster.com). I did a search on beowulf.org and came up with nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. Anyone with experience of gluster in HPC? Regards, /jon From bob at drzyzgula.org Wed Jan 13 05:34:30 2010 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Wed, 13 Jan 2010 08:34:30 -0500 Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF In-Reply-To: <20100113093741.GL17686@leitl.org> References: <20100113093741.GL17686@leitl.org> Message-ID: <20100113133430.GA27540@mx1.drzyzgula.org> FYI, I downloaded this and it is 46 pages of cursory, high-level overview of the concept of HPC. It doesn't even have any Rich Tennant cartoons. Just about anyone would get more out of spending an hour following threads out of the Wikipedia High-Performance Computing page. The execption might be a manager who doesn't know what the term "high-performance computing" means but at the same time somehow has a budget to go out and buy and staff a cluster -- that person might learn a little about where to start. On the plus side, the registration form seems to accept just about any random crap, and didn't even make me check the mailinator email address I used to get the download link. --Bob On 13/01/10 10:37 +0100, Eugen Leitl wrote: > > http://www.sun.com/x64/ebooks/hpc.jsp > > HPC for Dummies > > HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. > > This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. > > https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp > > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Wed Jan 13 10:05:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 12:05:27 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: On Wed, Jan 13, 2010 at 1:37 AM, Beat Rubischon wrote: > on the workstation of the user. Other reasons are braindead license servers > which are not NATable. Like the ones used by Catia or LS-DYNA. Management > could be much easier when the administrator is able to contact every device > directly from his workstation. Thanks! Oh! I thought NAT worked transparently and the application didnt even realize it was NAT-ed. I didn't know some servers could have a problem with this. -- Rahul From hahn at mcmaster.ca Wed Jan 13 10:37:39 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 13 Jan 2010 13:37:39 -0500 (EST) Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: >> on the workstation of the user. Other reasons are braindead license servers >> which are not NATable. Like the ones used by Catia or LS-DYNA. Management >> could be much easier when the administrator is able to contact every device >> directly from his workstation. I don't agree with the latter, at all. the marginal effort of admining through another box is trivial. > Oh! I thought NAT worked transparently and the application didnt even > realize it was NAT-ed. I didn't know some servers could have a problem > with this. a client using NAT will not know any different, but the _talked_to_ service might, since it'll see multiple connections from the same NAT server address(es), and won't be able to originate a socket to the client. (unless the NATer has some protocol-specific awareness like NATed FTP.) we have all our compute nodes on private addresses and also disable NAT. this does make it somewhat trickier to get external-hosted evil-type licenses to work (flexlm with vendor daemons). but I'd say this is a fairly useful dividing issue: clusters that are all-public tend to be personal-ish and small. clusters that are larger and support very wide groups tend to be more tightly controlled. I think the way to think of it is that if you have a personal or limited-purpose cluster, you _do_ in fact want it to depend on (and wait on) external resources (licenses, fileservers, GUI apps). for a large, broad-purpose cluster with lots of disparate users, it's very important to minimize those sources of complexity and inefficiency. From joshua_mora at usa.net Wed Jan 13 09:08:31 2010 From: joshua_mora at usa.net (Joshua mora acosta) Date: Wed, 13 Jan 2010 11:08:31 -0600 Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF Message-ID: <788oamRHf7008S29.1263402511@cmsweb29.cms.usa.net> Hi, I think you misinterpreted the tittle. It is what it is, "HPC for dummies". Enough to expose in a plain way to anyone what HPC is, which may not be that easy to make a good summary of such a broad topic in 46 pages. It would be great to see though a tittle like "HPC for the next decade" or "beyond HPC" that summarizes the ongoing investigations/challenges we are going to be facing. That summary of _balanced (politically correct)/broad(physics+hw+sw+business)_ ideas though may need similar level of effort as the effort done on "HPC for dummies". Regards, Joshua ------ Original Message ------ Received: 07:45 AM CST, 01/13/2010 From: Bob Drzyzgula To: Eugen Leitl Cc: Beowulf at beowulf.org Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF > FYI, I downloaded this and it is 46 pages of cursory, > high-level overview of the concept of HPC. It doesn't > even have any Rich Tennant cartoons. Just about anyone > would get more out of spending an hour following threads > out of the Wikipedia High-Performance Computing page. The > execption might be a manager who doesn't know what the > term "high-performance computing" means but at the same > time somehow has a budget to go out and buy and staff > a cluster -- that person might learn a little about > where to start. > > On the plus side, the registration form seems to accept > just about any random crap, and didn't even make me check > the mailinator email address I used to get the download > link. > > --Bob > > On 13/01/10 10:37 +0100, Eugen Leitl wrote: > > > > http://www.sun.com/x64/ebooks/hpc.jsp > > > > HPC for Dummies > > > > HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. > > > > This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. > > > > https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp > > > > -- > > Eugen* Leitl leitl http://leitl.org > > ______________________________________________________________ > > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From rpnabar at gmail.com Wed Jan 13 16:35:46 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 18:35:46 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: On Wed, Jan 13, 2010 at 12:37 PM, Mark Hahn wrote: > we have all our compute nodes on private addresses and also disable NAT. > this does make it somewhat trickier to get external-hosted evil-type > licenses to work (flexlm with vendor daemons). ?but I'd say this is a fairly > useful > dividing issue: clusters that are all-public tend to be personal-ish and > small. ?clusters that are larger and support very wide groups tend to be > more tightly controlled. ?I think the way to think of it is that if you > have a personal or limited-purpose cluster, you _do_ in fact want it to > depend on (and wait on) external resources (licenses, fileservers, GUI > apps). > for a large, broad-purpose cluster with lots of disparate users, it's very > important to minimize those sources of complexity and inefficiency. Thanks Mark! Makes a lot of sense. Thanks for sharing your experiences! -- Rahul From skylar at cs.earlham.edu Wed Jan 13 20:58:34 2010 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 13 Jan 2010 20:58:34 -0800 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: <4B4EA47A.2070500@cs.earlham.edu> Rahul Nabar wrote: > I always took it as natural to keep all compute nodes on a private > switch and assigned them local I/P addresses. This was almost > axiomatic for an HPC application in my mind. This way I can channel > all traffic to the world and logins while a select login-node. Then > firewall the login nodes carefully. > > Just today, though, on a new project the admin said he always keeps > his compute nodes with public I/Ps and runs individual firewalls on > them. > > This seemed just so wrong to me in so many ways but i was curious if > there are legitimate reasons why people might do this? Just curious. > > I do everything I can to keep cluster nodes on a private network, with only the head node visible on the public network. One exception I've had to make is when storage is on a separate network. NAT doesn't do well with CIFS/NFS so it's just easier giving the nodes fully-routeable IP addresses. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From skylar at cs.earlham.edu Wed Jan 13 21:08:43 2010 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 13 Jan 2010 21:08:43 -0800 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: References: Message-ID: <4B4EA6DB.8070207@cs.earlham.edu> Sangamesh B wrote: > Hi HPC experts, > > I seek your advise/suggestion to resolve a storage(NAS) server' > repeated hanging problem. > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > capacity 12 TB is connected to a management server Sun Fire X4150 > installed with RHEL 5.3 and this server is connected to a Gigabit > switch which provides cluster private network. The home directories on > the cluster are NFS mounted from storage partitions across all nodes > including the master. > > This server gets hanged repeatedly. As an initial troubleshooting > we installed Ganglia, to check network utilization. But its normal. > We're not getting how to troubleshoot it and resolve the problem. Can > anybode help us resolve this issue? Is there anything amiss according to the service processor? -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From rpnabar at gmail.com Thu Jan 14 15:40:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 14 Jan 2010 17:40:27 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. Message-ID: I've never had a cluster large enough to matter but I was thinking of running the Linpak-HPL benchmark (from the top500 site) just out of curiosity of knowing my actual teraflops. For one, it would tell me how much my Rmax/Rpeak so that I know how non-optimal my network and other infrastructure was. Question: How difficult it is getting that benchmark to run? Was eager to know of opinions from sys-admins who've "been there, done that". If it was a horrendously difficult process I might just skip it. Some their tuning sections were scary. Are there any good pointers on tuning parameter selection or ready made makefiles? I had a Intel Nehalem processor and a regular Gigabit network. -- Rahul From gus at ldeo.columbia.edu Thu Jan 14 17:25:17 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 14 Jan 2010 20:25:17 -0500 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: Message-ID: <4B4FC3FD.6040302@ldeo.columbia.edu> Hi Rahul It is a bit involved, but not very difficult to setup HPL. First get the Goto BLAS/LAPACK from TACC: http://www.tacc.utexas.edu/tacc-projects/ Install it using the Gnu compilers. Then get HPL from Netlib: www.netlib.org/benchmark/hpl/ Tweak with the Makefile to point to your mpi wrappers, and to the Goto Library. Build HPL. Read the TUNING file that comes with HPL. It has important information about the input parameters. The main ones are N, and P,Q. www.netlib.org/benchmark/hpl/tuning.html First, to test, run HPL in a single node or a few nodes, using small values of N, say 1000 to 20000. The maximum value of N can be approximated by Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). This uses all the RAM, but doesn't get into memory paging. Then run HPL on the whole cluster with the Nmax above. Nmax pushes the envelope, and is where your best performance (Rmax/Rpeak) is likely to be reached. Try several P/Q combinations for Nmax (see the TUNING file). I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Rahul Nabar wrote: > I've never had a cluster large enough to matter but I was thinking of > running the Linpak-HPL benchmark (from the top500 site) just out of > curiosity of knowing my actual teraflops. For one, it would tell me > how much my Rmax/Rpeak so that I know how non-optimal my network and > other infrastructure was. > > Question: How difficult it is getting that benchmark to run? Was eager > to know of opinions from sys-admins who've "been there, done that". > If it was a horrendously difficult process I might just skip it. Some > their tuning sections were scary. > > Are there any good pointers on tuning parameter selection or ready > made makefiles? I had a Intel Nehalem processor and a regular Gigabit > network. > > From walid.shaari at gmail.com Sat Jan 16 00:50:48 2010 From: walid.shaari at gmail.com (Walid) Date: Sat, 16 Jan 2010 11:50:48 +0300 Subject: [Beowulf] HPC/mpi courses Message-ID: Dear All, do you know of any official courses run in Europe, or Asia covering HPC system, or development. mpi or new distributed memory paradigms are welcome. kind regards Walid From madskaddie at gmail.com Sat Jan 16 02:44:12 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Sat, 16 Jan 2010 10:44:12 +0000 Subject: [Beowulf] Gridengine and bash + Modules Message-ID: Greetings, I'm using gridengine (6.2u4, open source ver.) and I would like to use the Modules software. Modules uses a shell function that must be exported (bash: "export -f func_name" in order to set environment variables), but gridengine has a bug related with bash exported functions[1]. Is anybody using gridengine, bash and modules? How to solve this? Changing shell is not an option ;) This issue is also being discussed here[2]. Thanks, Gil [1] - http://gridengine.sunsource.net/issues/show_bug.cgi?id=2173 [2] - http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&viewType=browseAll&dsMessageId=238562#messagefocus -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From richard.walsh at comcast.net Sat Jan 16 05:06:55 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 16 Jan 2010 13:06:55 +0000 (UTC) Subject: [Beowulf] HPC/mpi courses In-Reply-To: Message-ID: <968564064.9709281263647215844.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> On Saturday, January 16, 2010 Khalid M. Issa wrote: >Do you know of any official courses run in Europe, or Asia covering >HPC system, or development. mpi or new distributed memory paradigms >are welcome. No, but I could send you a PDF of about 100 slides I put together for a side-by-side course I teach on CAF and UPC (for your use only). I also have am am MPI intro course, but cannot send that (copyrighted). Finally, I recommend looking at the US National Lab sites (LLNL in particular) which have excellent OpenMP and MPI tutorials. Regards, Richard Walsh Principal, Thrashing River Computing, and Parallel Applications and Systems Manager, CUNY HPC Center _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcownie at cantab.net Sat Jan 16 06:14:24 2010 From: jcownie at cantab.net (James Cownie) Date: Sat, 16 Jan 2010 14:14:24 +0000 Subject: [Beowulf] HPC/mpi courses In-Reply-To: References: Message-ID: <70EC0BEE-F6FC-422D-BE96-39A9FE74E9C1@cantab.net> On 16 Jan 2010, at 08:50, Walid wrote: > Dear All, > > do you know of any official courses run in Europe, or Asia covering > HPC system, or development. mpi or new distributed memory paradigms > are welcome. > https://fs.hlrs.de/projects/par/events/2010/parallel_prog_2010/ there are no doubt many others. -- -- Jim -- James Cownie -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Sat Jan 16 19:02:35 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 16 Jan 2010 21:02:35 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: <4B4FC3FD.6040302@ldeo.columbia.edu> References: <4B4FC3FD.6040302@ldeo.columbia.edu> Message-ID: On Thu, Jan 14, 2010 at 7:25 PM, Gus Correa wrote: > > First, to test, run HPL in a single node or a few nodes, > using small values of N, say 1000 to 20000. > > The maximum value of N can be approximated by > Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). > This uses all the RAM, but doesn't get into memory paging. > > Then run HPL on the whole cluster with the Nmax above. > Nmax pushes the envelope, and is where your > best performance (Rmax/Rpeak) is likely to be reached. > Try several P/Q combinations for Nmax (see the TUNING file). > Thanks Gus! That helps a lot. I have Linpak running now on just a single server and am trying to tune and hit the Rpeak. I'm getting 62 Gflops but I think my peak has to be around 72 (2.26 GHz 8 cores Nehalem). On a single server test do you manage to hit the theoretical peak?What's a good Rmax / Rpeak to shoot for while tuning? Once I am confident I'm well tuned on one server I'll try and extend it to the whole cluster. -- Rahul From hearnsj at googlemail.com Sun Jan 17 01:01:20 2010 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 17 Jan 2010 09:01:20 +0000 Subject: [Beowulf] Jobs in the UK Message-ID: <9f8092cc1001170101m40e9f0ebm4b51d8c42927b743@mail.gmail.com> If anyone is looking for a job in the UK, there are a few on offer: http://www.jobserve.com/Systems-Specialist-Abingdon-Abingdon-Oxfordshire-Permanent-W6BA2DC924C566BD1.jsjob (I think you can work out who this is.... if not I can tell you off list!) http://www.jobserve.com/High-Performance-Computing-Grid-Services-Systems-Manager-Administrator-Kingston-upon-Thames-Surrey-Permanent-WE3A3B085349F371A.jsjob (I have installed clusters in both places.) There was an HPC job with a company in Oxfordshire, but no longer found on jobserve. From robh at dongle.org.uk Sun Jan 17 03:24:48 2010 From: robh at dongle.org.uk (Rob Horton) Date: Sun, 17 Jan 2010 11:24:48 +0000 Subject: [Beowulf] HPC/mpi courses In-Reply-To: References: Message-ID: <20100117112448.GA1181@wyddfa.dongle.org.uk> On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: > Dear All, > > do you know of any official courses run in Europe, or Asia covering > HPC system, or development. mpi or new distributed memory paradigms > are welcome. NAG run various courses on behalf of HECToR in the UK: http://www.hector.ac.uk/cse/training/ I'm not sure what the access arrangements are if your work isn't covered by one of the UK research councils. Rob From rpnabar at gmail.com Sun Jan 17 13:07:10 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 15:07:10 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping Message-ID: If I have a option between doing Hardware RAID versus having software raid via mdadm is there a clear winner in terms of performance? Or is the answer only resolvable by actual testing? I have a fairly fast machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Should I be using the vendor's hardware RAID or mdadm? In case a generic answer is not possible, what might be a good way to test the two options? Any other implications that I should be thinking about? Finally, there;s always hybrid approaches. I could have several small RAID5's at the hardware level (RIAD5 seems ok since I have smaller disks ~300 GB so not really in the domain where the RAID6 arguments kick in, I think) Then using LVM I can integrate storage while asking LVM to stripe across these RAID5's. Thus I'd get striping at two levels: LVM (software) and RAID5 (hardware). -- Rahul From bill at cse.ucdavis.edu Sun Jan 17 14:51:27 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sun, 17 Jan 2010 14:51:27 -0800 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B53946F.4080008@cse.ucdavis.edu> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? No. > Or is > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. > > Should I be using the vendor's hardware RAID or mdadm? Depends. Are you performance limited? If not I'd go with the one that makes the admin happier. I prefer software raid because I'm familiar with the interface, monitoring, and recovery. Not to mention the flexibility of setting RAID per partition and the ability to easily move RAIDs between machines is valuable to me. With hardware RAID I'd want a spare controller around in case of failure. Be warned that various hardware RAID companies seem to be making it harder to do software RAID. I've seen controllers do nasty things like not boot with 16 JBOD devices. > In case a > generic answer is not possible, what might be a good way to test the > two options? Ideally you would use your performance limited application as a benchmark. Sure there are plenty of micro benchmarks to help. Things like bonnie++, iozone, or postmark for a variety of random and sequential workloads. Postmark in particular can simulate a wide range of workloads. The ultimate benchmark is your particular usage. > Any other implications that I should be thinking about? Learning GUI, bugs, interface, of hardware requires time and energy. I'd look at the failure modes you are trying protect against. As far as reliability go I like for smartctl, lmsensors, RAID scrubbing and the like to work to I can keep an eye on the disks health. > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). For a given level of reliability size does effect the maximum number of disks in a RAID. Double disk failures do happen. I'd hesitate to spread a file system across multiple RAIDs just because of recovery performance issues. After all a file system spread across multiple RAIDs effectively reduces the entire group of disks to a single head. For 16 disks I often use three 5 disk RAID5s and one global spare. If I run 3 workloads, one each per 5 disks I get dramatically better performance than 3 workloads running on 15 disks. So most issues (slow performance during backups, failures, disk full, migration, etc) only effects 5 disks at a time and could even be migrated to a different machines. From ljdursi at scinet.utoronto.ca Sun Jan 17 15:07:45 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Sun, 17 Jan 2010 18:07:45 -0500 Subject: [Beowulf] HPC/mpi courses In-Reply-To: <20100117112448.GA1181@wyddfa.dongle.org.uk> References: <20100117112448.GA1181@wyddfa.dongle.org.uk> Message-ID: <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> On 2010-01-17, at 6:24AM, Rob Horton wrote: > On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: >> >> do you know of any official courses run in Europe, or Asia covering >> HPC system, or development. mpi or new distributed memory paradigms >> are welcome. > > NAG run various courses on behalf of HECToR in the UK: > http://www.hector.ac.uk/cse/training/ We have videos and slides up of a week-long MPI/OpenMP course we teach at SciNet at the University of Toronto: http://www.cita.utoronto.ca/~ljdursi/PSP/ Videos online are no substitute for being in the classroom yourself, of course, but it's better than nothing. Along those lines, does anyone have a good HPC / parallel computing textbook to get users started? There are (say) passable MPI books, or OpenMP, or even on the Intel thread building block stuff, but very little that integrates everything that I can find. Similarly with performance issues; O'Reilly used to have a pretty solid little book on HPC book which was very nice for teaching people to think about serial optimization, but the last edition was 1998 and I can't find anything comparable. - Jonathan -- Jonathan Dursi From a.travis at abdn.ac.uk Sun Jan 17 15:08:51 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Sun, 17 Jan 2010 17:08:51 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B539883.8040003@abdn.ac.uk> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? Or is > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Hello, Rahul. It depends which level of RAID you want to use, and if you want hot-swap capability. I use inexpensive 3ware 8006-2 RAID1 controllers and stripe them using "md" software RAID0 to make RAID10 arrays. This gives me good performance and hot-swap capability (the production md RAID driver does not support hot-swap). However, where "md" really scores is portability. My RAID's can only be read by 3ware controllers - I made a considered descision about this: The 3ware controllers are well-supported by Linux kernels, but it makes me uneasy using a proprietary RAID format. I do also use "md" RAID5 which is more space efficient, but read this: http://www.baarf.com/ > Should I be using the vendor's hardware RAID or mdadm? In case a > generic answer is not possible, what might be a good way to test the > two options? Any other implications that I should be thinking about? In fact, "mdadm" is just the user-space command for controlling the "md" driver. The problem with using an on-board RAID controller is that many of these are 'host' RAID (i.e. need a Windows driver to do the RAID) in which case you are using the CPU anyway, and they also use proprietary formats. Generally, I just use SATA mode on the on-board RAID controller and create an "md" RAID. This means that I can replace a motherboard withour worrying if it has the same type of RAID controller on-board. > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). Yes, I think a hybrid approach is good because that's what I use ;-) However, I would avoid relying on LVM mirroring for data protection. It is much safer to stripe a set of RAID1's using LVM. I don't think LVM is useful unless you are managing a disk farm. The commonest issue in disk perfomance is decoupling seeks between different spindles, so I put the system files on a different RAID1-set to /export (or /home) filesystems. HTH, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From gus at ldeo.columbia.edu Sun Jan 17 18:03:25 2010 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sun, 17 Jan 2010 21:03:25 -0500 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: <4B4FC3FD.6040302@ldeo.columbia.edu> Message-ID: Hi Rahul I've got Rmax/Rpeak around 84% on the cluster (AMD Opteron Shanghai, IB on a single switch). I didn't have the cluster available to play with HPL for too long, not too much tuning, I had to move to production mode. Some folks on mailing lists said they'd get 90%, but the topmost group in Top500 get less (as of mid-2009 it was ~75%, IIRR), probably because of their big networks with stacked switches and communication overhead. To optimize in a single node, apply also the formula for Nmax, using the node's RAM. P and Q (block matrix decomposition) tend to be optimal when they are close to each other. With Nehalem you may have to consider the extra complexity of symmetric multi-threading (hyperthreading), and whether it makes or doesn't make a difference on very regular problems like HPL, with big loops and not much branching/ifs. (Your real world computational chemistry problems probably are not like that.) Have you tried HPL with and without SMT/hyperthreading? It maybe worth testing on a single node at least. I hope this helps. Gus Correa On Jan 16, 2010, at 10:02 PM, Rahul Nabar wrote: > On Thu, Jan 14, 2010 at 7:25 PM, Gus Correa wrote: > >> >> First, to test, run HPL in a single node or a few nodes, >> using small values of N, say 1000 to 20000. >> >> The maximum value of N can be approximated by >> Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). >> This uses all the RAM, but doesn't get into memory paging. >> >> Then run HPL on the whole cluster with the Nmax above. >> Nmax pushes the envelope, and is where your >> best performance (Rmax/Rpeak) is likely to be reached. >> Try several P/Q combinations for Nmax (see the TUNING file). >> > > Thanks Gus! That helps a lot. I have Linpak running now on just a > single server and am trying to tune and hit the Rpeak. > > I'm getting 62 Gflops but I think my peak has to be around 72 (2.26 > GHz 8 cores Nehalem). On a single server test do you manage to hit the > theoretical peak?What's a good Rmax / Rpeak to shoot for while tuning? > > Once I am confident I'm well tuned on one server I'll try and extend > it to the whole cluster. > > -- > Rahul From alscheinine at tuffmail.us Sun Jan 17 19:13:28 2010 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Sun, 17 Jan 2010 21:13:28 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B539883.8040003@abdn.ac.uk> References: <4B539883.8040003@abdn.ac.uk> Message-ID: <4B53D1D8.1060909@tuffmail.us> I had a nightmare problem with a newly compiled kernel not booting. The problem may have been with the command mkblkdevs of nash but in any case I did extensive web search on the AHCI controller that I had in both a notebook and a desktop computer, both of which had the booting problem. One example of the controller is Intel ICH9M-E/M SATA AHCI Controller In web postings there were many problems attributed to this controller. It does RAID 1 using software. The Linux driver was (is?) a "work in progress" from what I gather reading web postings. Dr. A.J. Travis wrote: > The problem with using an on-board RAID controller is that many of > these are 'host' RAID (i.e. need a Windows driver to do the RAID) > in which case you are using the CPU anyway, and they also use > proprietary formats. My point is to underline this fact. Hardware RAID 1 should be simple and reliable. But when the RAID controller relies on software running on the O/S, then it might be better to use Linux software RAID. Alan -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 From landman at scalableinformatics.com Sun Jan 17 19:36:43 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 17 Jan 2010 22:36:43 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B53D74B.9000301@scalableinformatics.com> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? Or is Depends upon workload, writes vs reads, streaming vs random IO, number of simultaneous readers/writers. There is no real clear answer. > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Testing is a good thing. Sadly, too many people test *after* they purchased something (only to discover what is meant by the term "marketing benchmark numbers"). > Should I be using the vendor's hardware RAID or mdadm? In case a Ohhh ... it depends. Some of the "vendors" hardware raid ... heck ... most of it ... is rebadged LSI gear. Usually their lower end stuff which is sometimes fake-raid. Use fake-raid only if no other options exist. More in a moment. > generic answer is not possible, what might be a good way to test the > two options? Any other implications that I should be thinking about? Benchmark your load using a load generator like fio. > > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking RAID6 kicks in purely from the second correlated disk failure scenario. This is size independent. It happens, and you need to be prepared. > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). LVM is not a performance tool. Use it to help you manage things, not speed things. Our own testing puts our 24 bay DV4 unit at a bit more than 1GB/s sustained read (large block sequential) in RAID6, with writes in the 400-500 MB/s region (large block sequential). This is MD RAID based. Our "equivalent" JR4 system clocks in at nearly double the read speed, and about 3+ x the write speed. This is a hardware RAID system. Your mileage will vary ... tremendously ... as a function of your IO pattern. My own suggestion is to test before you buy. After you buy, well, its a bit harder to change your mind. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sun Jan 17 19:53:04 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 21:53:04 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53D74B.9000301@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> Message-ID: On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman wrote: > > Ohhh ... it depends. ?Some of the "vendors" hardware raid ... heck ... most > of it ... is rebadged LSI gear. ?Usually their lower end stuff which is > sometimes fake-raid. ?Use fake-raid only if no other options exist. Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? > LVM is not a performance tool. ?Use it to help you manage things, not speed > things. I had thought so. But why then does LVM have features like striping if not for performance? Or are they just not so good? -- Rahul From rpnabar at gmail.com Sun Jan 17 19:55:57 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 21:55:57 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B539883.8040003@abdn.ac.uk> References: <4B539883.8040003@abdn.ac.uk> Message-ID: Thanks Tony for the helpful tips! On Sun, Jan 17, 2010 at 5:08 PM, Tony Travis wrote: > > However, I would avoid relying on LVM mirroring for data protection. It is > much safer to stripe a set of RAID1's using LVM. I don't think LVM is useful > unless you are managing a disk farm. The commonest issue in disk perfomance > is decoupling seeks between different spindles, so I put the system files on > a different RAID1-set to /export (or /home) filesystems. My problem is that I have several different "storage boxes" each running on RAID5. But I use LVM to aggregate this storage. While doing this I noticed that LVM offers striping too. THat's what got me thinking.... -- Rahul From landman at scalableinformatics.com Sun Jan 17 19:58:23 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 17 Jan 2010 22:58:23 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: <4B53D74B.9000301@scalableinformatics.com> Message-ID: <4B53DC5F.3040601@scalableinformatics.com> Rahul Nabar wrote: > On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman > wrote: >> Ohhh ... it depends. Some of the "vendors" hardware raid ... heck ... most >> of it ... is rebadged LSI gear. Usually their lower end stuff which is >> sometimes fake-raid. Use fake-raid only if no other options exist. > > Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? Its a software RAID implementation pretending to be a hardware RAID implementation. They are rarely if ever as good as MD. Many of them in Linux will invoke dm (the "other" RAID engine) as dm has "support" for fake-raid. Note that we have lost data (multiple times) with dm+fake-raid in testing, so we don't recommend its use in important machines (ones which you can't afford to lose). This could be due to bad drivers for the chips in question, but we aren't taking chances. > > >> LVM is not a performance tool. Use it to help you manage things, not speed >> things. > > I had thought so. But why then does LVM have features like striping if > not for performance? Or are they just not so good? LVM doesn't perform as well as MD RAID for performance. You can use it, just be advised that you are leaving a great deal of performance on the table if you do so. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sun Jan 17 20:01:50 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 22:01:50 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: On Sun, Jan 17, 2010 at 9:58 PM, Joe Landman wrote: > Its a software RAID implementation pretending to be a hardware RAID > implementation. ?They are rarely if ever as good as MD. ?Many of them in > Linux will invoke dm (the "other" RAID engine) as dm has "support" for > fake-raid. ?Note that we have lost data (multiple times) with dm+fake-raid > in testing, so we don't recommend its use in important machines (ones which > you can't afford to lose). ?This could be due to bad drivers for the chips > in question, but we aren't taking chances. Ah! Thanks for the tip. I thought the line between hardware and software was clear cut. Any way to oust such an impostor RAID? What signs would it show? If the hardware in fact does use system software and CPU I can find out by looking for a specific daemon etc? Or maybe CPU loads? -- Rahul From gerry.creager at tamu.edu Sun Jan 17 20:24:51 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Sun, 17 Jan 2010 22:24:51 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53D1D8.1060909@tuffmail.us> References: <4B539883.8040003@abdn.ac.uk> <4B53D1D8.1060909@tuffmail.us> Message-ID: <4B53E293.5000407@tamu.edu> Hardware RAID, in my experience, works well with most LSI controllers (that haven't been modified to become Dell PERC controllers) and 3Ware controllers. I've had pretty grim results with most others. A colleague and I had great initial results with several ARECA controllers, but then they lost their minds and did strange things to our RAID'd volumes. Different Linux distro's and no operator intervention at the times of failure, either. Hardware RAID should use a real onboard controller and be real hardware RAID. A lot of 'em use a "driver" which relies on the OS to actually do the RAID but with some proprietary bits that I don't know and can't see when they break. In this case I'd rather use MD s/w RAID. That said, if I can use a current 3Ware or LSI card (note caveat above; I've not had good performance with any PERC's) I'd rather do hardware RAID for simplicity and recoverability. But, you have to do your own due diligence and know your hardware. gerry Alan Louis Scheinine wrote: > I had a nightmare problem with a newly compiled kernel not booting. > The problem may have been with the command mkblkdevs of nash but in > any case I did extensive web search on the AHCI controller that I had > in both a notebook and a desktop computer, both of which had the > booting problem. > > One example of the controller is > Intel ICH9M-E/M SATA AHCI Controller > > In web postings there were many problems attributed to this controller. > It does RAID 1 using software. The Linux driver was (is?) a "work in > progress" > from what I gather reading web postings. > > Dr. A.J. Travis wrote: >> The problem with using an on-board RAID controller is that many of > > these are 'host' RAID (i.e. need a Windows driver to do the RAID) > > in which case you are using the CPU anyway, and they also use > > proprietary formats. > > My point is to underline this fact. Hardware RAID 1 should be simple and > reliable. But when the RAID controller relies on software running on the > O/S, then it might be better to use Linux software RAID. > > Alan > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rpnabar at gmail.com Sun Jan 17 20:46:21 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 22:46:21 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: <4B4FC3FD.6040302@ldeo.columbia.edu>

Message-ID: On Sun, Jan 17, 2010 at 8:03 PM, Gustavo Correa wrote: > I've got Rmax/Rpeak around 84% on the cluster (AMD Opteron Shanghai, IB on a single switch). > I didn't have the cluster available to play with HPL for too long, not too much tuning, > I had to move to production mode. > Some folks on mailing lists said they'd get 90%, but the topmost group in Top500 get less > (as of mid-2009 it was ~75%, IIRR), probably because of their big networks > with stacked switches and communication overhead. Thanks Gus! After further tuning I am at 95% on a single node. But performance is falling drastically when i go multinode. Maybe because I only have a 1 GigE network right now. -- Rahul From gerry.creager at tamu.edu Sun Jan 17 20:54:52 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Sun, 17 Jan 2010 22:54:52 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: <4B53E99C.6080905@tamu.edu> +1: Reality. Joe Landman wrote: > Rahul Nabar wrote: >> On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman >> wrote: >>> Ohhh ... it depends. Some of the "vendors" hardware raid ... heck >>> ... most >>> of it ... is rebadged LSI gear. Usually their lower end stuff which is >>> sometimes fake-raid. Use fake-raid only if no other options exist. >> >> Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? > > Its a software RAID implementation pretending to be a hardware RAID > implementation. They are rarely if ever as good as MD. Many of them in > Linux will invoke dm (the "other" RAID engine) as dm has "support" for > fake-raid. Note that we have lost data (multiple times) with > dm+fake-raid in testing, so we don't recommend its use in important > machines (ones which you can't afford to lose). This could be due to > bad drivers for the chips in question, but we aren't taking chances. > >> >> >>> LVM is not a performance tool. Use it to help you manage things, not >>> speed >>> things. >> >> I had thought so. But why then does LVM have features like striping if >> not for performance? Or are they just not so good? > > LVM doesn't perform as well as MD RAID for performance. You can use it, > just be advised that you are leaving a great deal of performance on the > table if you do so. > > >> > > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From forum.san at gmail.com Mon Jan 18 01:13:05 2010 From: forum.san at gmail.com (Sangamesh B) Date: Mon, 18 Jan 2010 14:43:05 +0530 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: <4B4EA6DB.8070207@cs.earlham.edu> References: <4B4EA6DB.8070207@cs.earlham.edu> Message-ID: Hello all, Thanks for your suggestions. But we lost the access to the cluster because of the delay. But I got useful information to debug next time. Thanks, Sangamesh On Thu, Jan 14, 2010 at 10:38 AM, Skylar Thompson wrote: > Sangamesh B wrote: > > Hi HPC experts, > > > > I seek your advise/suggestion to resolve a storage(NAS) server' > > repeated hanging problem. > > > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > > capacity 12 TB is connected to a management server Sun Fire X4150 > > installed with RHEL 5.3 and this server is connected to a Gigabit > > switch which provides cluster private network. The home directories on > > the cluster are NFS mounted from storage partitions across all nodes > > including the master. > > > > This server gets hanged repeatedly. As an initial troubleshooting > > we installed Ganglia, to check network utilization. But its normal. > > We're not getting how to troubleshoot it and resolve the problem. Can > > anybode help us resolve this issue? > Is there anything amiss according to the service processor? > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Mon Jan 18 03:30:03 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 18 Jan 2010 12:30:03 +0100 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: References: <4B4EA6DB.8070207@cs.earlham.edu> Message-ID: <05BCD392-37D0-422D-8BB3-8E6BCE1497ED@staff.uni-marburg.de> Hi, Am 18.01.2010 um 10:13 schrieb Sangamesh B: > Hello all, > > Thanks for your suggestions. > But we lost the access to the cluster because of the delay. but the access to the service processor should still be there, and I think Skylar referred to the ILOM interface. -- Reuti > > But I got useful information to debug next time. > > Thanks, > Sangamesh > On Thu, Jan 14, 2010 at 10:38 AM, Skylar Thompson > wrote: > Sangamesh B wrote: > > Hi HPC experts, > > > > I seek your advise/suggestion to resolve a storage(NAS) server' > > repeated hanging problem. > > > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > > capacity 12 TB is connected to a management server Sun Fire X4150 > > installed with RHEL 5.3 and this server is connected to a Gigabit > > switch which provides cluster private network. The home > directories on > > the cluster are NFS mounted from storage partitions across all nodes > > including the master. > > > > This server gets hanged repeatedly. As an initial troubleshooting > > we installed Ganglia, to check network utilization. But its normal. > > We're not getting how to troubleshoot it and resolve the problem. > Can > > anybode help us resolve this issue? > Is there anything amiss according to the service processor? > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From madskaddie at gmail.com Mon Jan 18 06:38:13 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Mon, 18 Jan 2010 14:38:13 +0000 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: <1263676497.14044.15.camel@voltaire.rc.usf.edu> References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> Message-ID: 2010/1/16 Brian Smith : > I'm using this in our environment. ?I've simply added the Modules > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > puppet to manage everything, so this is easy). ?This ensures that > Modules is properly integrated with your environment regardless of > whether you are using an interactive or non-interactive invocation of > these shells. ?This works for SGE (I'm on 6.2u4, ATM) > But it seems that gridengine spawns like "bash script_name" so no rc files are read. Reading bash manpage, I found the BASH_ENV environment variable: """ When bash is started non-interactively, to run a shell script, for example, it looks for the variable BASH_ENV in the environment, expands its value if it appears there, and uses the expanded value as the name of a file to read and execute. Bash behaves as if the following command were executed: if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi but the value of the PATH variable is not used to search for the file name. """ (bash manpage) Right now I'm setting this variable and with the "-V" job submission flag it's working well (it does not work correctly without it) Gil -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From madskaddie at gmail.com Mon Jan 18 11:53:39 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Mon, 18 Jan 2010 19:53:39 +0000 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: <1263836258.10961.21.camel@voltaire.rc.usf.edu> References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: 2010/1/18 Brian Smith : > Ah, the RedHat-isms that we take for granted... hah! ?I forgot that the > default ~/.bashrc I push out to everyone sources /etc/bashrc by default. > What distro are you using? > Debian lenny > There's also this bit of goodness from the man page: > > "Bash attempts to determine when it is being run with its standard input > connected ?to a a network connection, as if by the remote shell daemon, > usually rshd, or the secure shell daemon sshd. > The Debian bash man page doesn't say the word "sshd" (only "rshd"), and I'm using ssh as the remote shell, so it may be the case (weird, but possible). (...) > > I wonder if sge_shepherd doesn't, in fact, trick shells into behaving > this way... I know I'm not using BASH_ENV and my modules environment > works correctly. > Just to be sure we aren't missing something: you can load a module inside the submit job, correct? Case 1: - module load something - qsub job.sh - cat job.sh #!/bin/bash #(sge config stuff) mpirun ... #EOF Case 2 (what I pretend): - qsub job.sh - cat job.sh #!/bin/bash #(sge config stuff) module add something mpirun ... #EOF > > -Brian > > -- > Brian Smith > Senior Systems Administrator > IT Research Computing, University of South Florida > 4202 E. Fowler Ave. ENB308 > Office Phone: +1 813 974-1467 > Organization URL: http://rc.usf.edu > > > On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: >> 2010/1/16 Brian Smith : >> > I'm using this in our environment. ?I've simply added the Modules >> > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use >> > puppet to manage everything, so this is easy). ?This ensures that >> > Modules is properly integrated with your environment regardless of >> > whether you are using an interactive or non-interactive invocation of >> > these shells. ?This works for SGE (I'm on 6.2u4, ATM) >> > >> >> But it seems that gridengine spawns like "bash script_name" so no rc >> files are read. Reading bash manpage, I found the BASH_ENV environment >> variable: >> >> """ >> When ?bash ?is ?started non-interactively, to run a shell script, for >> example, it looks for the variable BASH_ENV in the environment, >> expands its value if it appears there, and uses the expanded value as >> the name of a file to read and execute. ?Bash behaves as if the >> following command were executed: >> ? ? ? ? ? ? ? if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi >> but the value of the PATH variable is not used to search for the file name. >> """ >> (bash manpage) >> >> Right now I'm setting this variable and with the "-V" job submission >> flag it's working well (it does not work correctly without it) >> >> Gil >> >> > > -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From reuti at staff.uni-marburg.de Mon Jan 18 13:03:32 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 18 Jan 2010 22:03:32 +0100 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: <8D2C769A-A86A-4B7D-8787-59228B482070@staff.uni-marburg.de> Hi, Am 18.01.2010 um 20:53 schrieb madskaddie at gmail.com: > 2010/1/18 Brian Smith : >> Ah, the RedHat-isms that we take for granted... hah! I forgot >> that the >> default ~/.bashrc I push out to everyone sources /etc/bashrc by >> default. >> What distro are you using? >> > > Debian lenny > >> There's also this bit of goodness from the man page: >> >> "Bash attempts to determine when it is being run with its standard >> input >> connected to a a network connection, as if by the remote shell >> daemon, >> usually rshd, or the secure shell daemon sshd. >> > > The Debian bash man page doesn't say the word "sshd" (only "rshd"), > and I'm using ssh as the remote shell, so it may be the case (weird, > but possible). > > (...) a) you could define a starter method in SGE's queue setup to define the necessary things. But what functions are these in detail? I checked on one cluster and to me it looks like it's only one: module () { eval `/mypath/environment-modules/3.2.6//Modules/$MODULE_VERSION/ bin/modulecmd bash $*` } and the eval isn't necessary IMO. It would be necessary if $* would include something which has to be interpreted again. So an alias is also working for me (unset module and define it like below): b) alias module='/cm/local/apps/environment-modules/3.2.6//Modules/ $MODULE_VERSION/bin/modulecmd bash $*' or c) you could define a wrapper with a script and put it in /user/ local/bin or alike. -- Reuti >> >> I wonder if sge_shepherd doesn't, in fact, trick shells into beha >> ving >> this way... I know I'm not using BASH_ENV and my modules environment >> works correctly. >> > > Just to be sure we aren't missing something: you can load a module > inside the submit job, correct? > > Case 1: > > - module load something > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > mpirun ... > > #EOF > > Case 2 (what I pretend): > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > module add something > mpirun ... > > #EOF > > > > >> >> -Brian >> >> -- >> Brian Smith >> Senior Systems Administrator >> IT Research Computing, University of South Florida >> 4202 E. Fowler Ave. ENB308 >> Office Phone: +1 813 974-1467 >> Organization URL: http://rc.usf.edu >> >> >> On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: >>> 2010/1/16 Brian Smith : >>>> I'm using this in our environment. I've simply added the Modules >>>> environment code to /etc/bashrc and /etc/csh.cshrc on all nodes >>>> (I use >>>> puppet to manage everything, so this is easy). This ensures that >>>> Modules is properly integrated with your environment regardless of >>>> whether you are using an interactive or non-interactive >>>> invocation of >>>> these shells. This works for SGE (I'm on 6.2u4, ATM) >>>> >>> >>> But it seems that gridengine spawns like "bash script_name" so no rc >>> files are read. Reading bash manpage, I found the BASH_ENV >>> environment >>> variable: >>> >>> """ >>> When bash is started non-interactively, to run a shell script, >>> for >>> example, it looks for the variable BASH_ENV in the environment, >>> expands its value if it appears there, and uses the expanded >>> value as >>> the name of a file to read and execute. Bash behaves as if the >>> following command were executed: >>> if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi >>> but the value of the PATH variable is not used to search for the >>> file name. >>> """ >>> (bash manpage) >>> >>> Right now I'm setting this variable and with the "-V" job submission >>> flag it's working well (it does not work correctly without it) >>> >>> Gil >>> >>> >> >> > > > > -- > " > It can't continue forever. The nature of exponentials is that you push > them out and eventually disaster happens. > " > Gordon Moore (Intel co-founder and author of the Moore's law) > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Mon Jan 18 23:26:06 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 19 Jan 2010 01:26:06 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B554350.5070506@gmail.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> Message-ID: On Mon, Jan 18, 2010 at 11:29 PM, Richard Chang wrote: > > How do we differentiate between the software and hardware RAID > implementations. ANY visual difference?, are they identifiable?. That's a crucial question for me too. I am using the Dell PERC-6-e cards. Any easy ways of telling the FakeRAID's apart? -- Rahul From landman at scalableinformatics.com Tue Jan 19 06:21:10 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 09:21:10 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B554350.5070506@gmail.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> Message-ID: <4B55BFD6.2040404@scalableinformatics.com> Richard Chang wrote: > Joe Landman wrote: >> >> Its a software RAID implementation pretending to be a hardware RAID >> implementation. They are rarely if ever as good as MD. Many of them >> in Linux will invoke dm (the "other" RAID engine) as dm has "support" >> for fake-raid. Note that we have lost data (multiple times) with >> dm+fake-raid in testing, so we don't recommend its use in important >> machines (ones which you can't afford to lose). This could be due to >> bad drivers for the chips in question, but we aren't taking chances. >> >> > Hello Joe, > I would like to know specifically what models of LSI boxes are software > RAID implementation pretending to be a hardware RAID implementation. I Hi Richard: This I cannot tell you, as I don't have a comprehensive list of what uses what driver. I'd suggest looking at what drivers it loads for disks when it comes up. If dmraid comes up *and* enumerates devices, you have a strong probability that it is a fake-raid. This is not to say dmraid is bad. Again, its the underlying driver or chipset that we often run into problems with. > have a few LSI boxes where I work, and your post made me think if they > really are Hardware RAID implementation. > > I have the LSI 2822(old), LSI 4900 & also LSI 7900 controllers based > storage. > > How do we differentiate between the software and hardware RAID > implementations. ANY visual difference?, are they identifiable?. Rarely. Fake raid will generally not have any RAM cache or battery backup capability. In some instances, fake raid is *ok* for OS drives (RAID1 only), if the bios is smart enough to use it correctly, the underlying fake raid driver is relatively stable, and you have reasonable disks. Otherwise, mdadm works great, though you have to patch Redhat/Centos, as they, by default, use dmraid for the moment. Later model Fedora appear to have switched to MD raid (after 9 from what I saw, last time I played with it). > > Richard. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Tue Jan 19 06:43:11 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 09:43:11 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55C4FF.1050108@scalableinformatics.com> Joe Landman wrote: > This I cannot tell you, as I don't have a comprehensive list of what > uses what driver. I'd suggest looking at what drivers it loads for > disks when it comes up. If dmraid comes up *and* enumerates devices, > you have a strong probability that it is a fake-raid. This is not to > say dmraid is bad. Again, its the underlying driver or chipset that we > often run into problems with. I should also point out that the presence of dmraid and device enumeration is still not sufficient for determining whether something is or is not a fake-raid. To wit root at crunch:~# df -h /data Filesystem Size Used Avail Use% Mounted on /dev/mapper/2001b4d2306a71820-part1 4.6T 2.4T 2.2T 53% /data and /data is on a most assuredly a hardware accelerated RAID. There were some additional tools I installed with the distribution which also installed device-mapper. I could turn it off, but then I have some other bits to work around. It's easier to leave it on. FWIW: I am no fan of device mapper (the dm part of dmraid). It has caused us some serious grief in the past (serious grief == data lossage). Its not in a league with things like rieserfs, ext2, NTFS, and whatnot ... When device mapper works correctly (as above) it works fine. A tautology/Yogi-Berra-ism for sure. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From robh at dongle.org.uk Tue Jan 19 07:50:36 2010 From: robh at dongle.org.uk (Robert Horton) Date: Tue, 19 Jan 2010 15:50:36 +0000 Subject: [Beowulf] Filesystem benchmarks Message-ID: <1263916236.6962.72.camel@moelwyn.maths.qmul.ac.uk> Hi, I'm trying to run some benchmarks on a file server to test the effect of different filesystems, hardware vs software raid, putting the journal on a separate device, etc. The machine has 24G of RAM so I'm running /opt/iozone/bin/iozone -c -C -g 48g -n 48g -i 0 -i 1 -i 2 -q 1m -y 4k -a The trouble with this is that it's taking an absolute age - the first random read has been going for about 6 hrs so far. I'm therefore wondering about using either -e (to include a flush) or -I (to bypass the cache) with smaller file sizes. Will this still give a useful comparison? Presumably with -e the read data is still potentially read from cache? Any thoughts? Should I be trying a different approach altogether? Thanks, Rob From jlforrest at berkeley.edu Tue Jan 19 09:12:27 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 19 Jan 2010 09:12:27 -0800 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55E7FB.7080305@berkeley.edu> On 1/19/2010 6:21 AM, Joe Landman wrote: > Rarely. Fake raid will generally not have any RAM cache or battery > backup capability. Not only that, but it won't have any hardware to do parity calculations. (It might be hard to recognize such hardware). > In some instances, fake raid is *ok* for OS drives (RAID1 only), if the > bios is smart enough to use it correctly, the underlying fake raid > driver is relatively stable, and you have reasonable disks. It doesn't take much extra work to do RAID0 or RAID1 so whether this is done by a fake raid driver or the md raid driver probably isn't significant from the resource usage point of view. The only advantage I can think of for fake raid is that there's usually a BIOS of sorts in the fake raid card that lets you manipulate the raid units. This might be more convenient than having to boot Linux and mess with mdadm commands. For RAID levels that require parity calculations, then having a hardware RAID card is a win because the card does a lot of work and hides both the parity calculations and required IOs from the host system. On the third hand, if you have a system with lots of CPU and I/O capacity that wouldn't otherwise get used, then it could be argued that a hardware RAID card is an unnecessary expense. In the old days it was easier to decide to go with hardware RAID. These days it's best to do test with both hardware and software RAID, and then see if the measured improvements of hardware RAID (if any) justify its expense. Of course, in any production system you'll want a few extra RAID cards lying around just in case. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From a.travis at abdn.ac.uk Tue Jan 19 10:22:45 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Tue, 19 Jan 2010 12:22:45 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55E7FB.7080305@berkeley.edu> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> Message-ID: <4B55F875.70501@abdn.ac.uk> Jon Forrest wrote: > [...] > The only advantage I can think of for fake raid is > that there's usually a BIOS of sorts in the fake > raid card that lets you manipulate the raid units. > This might be more convenient than having to boot > Linux and mess with mdadm commands. Hello, Jon and Joe. I use Adaptec 'host' RAID controllers as a way of adding SATA ports to old motherboards that don't have any on-board, configure them in SATA (i.e. non-RAID) mode and build "md" software RAID's using the ports. > For RAID levels that require parity calculations, then > having a hardware RAID card is a win because the card > does a lot of work and hides both the parity calculations > and required IOs from the host system. On the third hand, > if you have a system with lots of CPU and I/O capacity > that wouldn't otherwise get used, then it could be argued > that a hardware RAID card is an unnecessary expense. It has been argued before that, these days, "md" software RAID often performs better because the 'host' CPU is considerably more powerful than the embedded processor on a 'hardware' RAID controller. However, one point that is often overlooked, and the reason I chose a hybrid approach is that AFAIK "md" RAID's do not support hot-swap. I would be very interested to know if anyone is using hot-swap "md" RAID's in production servers: I do realise that development work is going on. > In the old days it was easier to decide to go with > hardware RAID. These days it's best to do test with > both hardware and software RAID, and then see if > the measured improvements of hardware RAID (if any) > justify its expense. Of course, in any production system > you'll want a few extra RAID cards lying around just > in case. Yes, I agree with that! A great virtue of "md" RAID's is that they are independant of the underlying disk controller, and you can easily replace broken controllers or motherboards. If you don't have a spare RAID controller supporting the proprietary format your shiny 'hardware' RAID is using then you can't access your data :-( Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From landman at scalableinformatics.com Tue Jan 19 10:43:30 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 13:43:30 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55F875.70501@abdn.ac.uk> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> Message-ID: <4B55FD52.206@scalableinformatics.com> Tony Travis wrote: > It has been argued before that, these days, "md" software RAID often > performs better because the 'host' CPU is considerably more powerful > than the embedded processor on a 'hardware' RAID controller. However, > one point that is often overlooked, and the reason I chose a hybrid > approach is that AFAIK "md" RAID's do not support hot-swap. I would be > very interested to know if anyone is using hot-swap "md" RAID's in > production servers: I do realise that development work is going on. Not entirely correct. SATA where the hot swap (bring device in/out) logic is. And it does (at least in modern kernels) support physical removal/addition of devices. The MD system itself is event driven. You can "automate" device removal/insertion into a unit, and rebuild the RAID as needed ... to a degree. The issue we run into is that occasionally, we have to force a bus scan on the scsi buses to see new SATA drives. Once that is done, some of our other tools automate the incorporation of the new disk within the RAID. > >> In the old days it was easier to decide to go with >> hardware RAID. These days it's best to do test with >> both hardware and software RAID, and then see if >> the measured improvements of hardware RAID (if any) >> justify its expense. Of course, in any production system >> you'll want a few extra RAID cards lying around just >> in case. > > Yes, I agree with that! > > A great virtue of "md" RAID's is that they are independant of the > underlying disk controller, and you can easily replace broken > controllers or motherboards. If you don't have a spare RAID controller > supporting the proprietary format your shiny 'hardware' RAID is using > then you can't access your data :-( In the many RAID cases we have dealt with over the years, we haven't run into this as an issue. That is, while touted as a real tangible benefit of MD RAID, it is of dubious real value in most of the cases we have encountered. Really the benefit is that of being against the change of business conditions for your RAID vendor. If you plan on keeping the same array active until it dies (4-10 years), this could be a consideration. However, you also have to worry about disk availability/compatibility, etc. That is, its not *just* a RAID card issue, its a full stack issue. MD allows you to reduce the risk in various portions of this stack. > > Bye, > > Tony. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From a.travis at abdn.ac.uk Tue Jan 19 13:30:22 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Tue, 19 Jan 2010 15:30:22 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55FD52.206@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> <4B55FD52.206@scalableinformatics.com> Message-ID: <4B56246E.2050505@abdn.ac.uk> Joe Landman wrote: > [...] > Not entirely correct. SATA where the hot swap (bring device in/out) > logic is. And it does (at least in modern kernels) support physical > removal/addition of devices. The MD system itself is event driven. You > can "automate" device removal/insertion into a unit, and rebuild the > RAID as needed ... to a degree. The issue we run into is that > occasionally, we have to force a bus scan on the scsi buses to see new > SATA drives. Once that is done, some of our other tools automate the > incorporation of the new disk within the RAID. Hello, Joe. The "sdhci" driver in the 2.6 kernel does not notify the kernel of a device change, neither does it flush the kernel buffers. Hot-swapping drives using the standard SATA driver is a great way to corrupt your disks, all it does on a SATA disconnect is try connecting again under the assumption that the same drive is attached but the data rate is too high for the cable - I have practical experience of this problem ;-) I started off my quest to build a COTS RAID5 believing what you just said to be true, but I think there is a popular misconception about SATA: It's true that most modern SATA controllers do support hot-swap electrically, but SATA device drivers to my, albeit limited, knowledge do not notify the kernel that a device has been removed or added. The 3ware 'twe' 'hardware' RAID driver does, in response to events from the RAID controller firmware that is monitoring the physical drives. I've looked at the SATA driver sources quite carefully because I do want to use hot-swap with "md" if that is a *safe* and reliable thing to do. However, I am not confident that it is (yet!). Please correct me if I am wrong, because it would be very useful to be able to *reliably* hot-swap SATA drives on an "md" RAID. I bought a lot of 3ware 8006-2's because I don't trust "md" hot-swapping. The 8006-2 is well supported under Linux. >[...] > In the many RAID cases we have dealt with over the years, we haven't run > into this as an issue. That is, while touted as a real tangible benefit > of MD RAID, it is of dubious real value in most of the cases we have > encountered. I've dealt with quite a few cases myself, where we have upgraded motherboards (esp. Tyan) with completely different on-board RAID, with hit and miss support under Linux. Typically, I've replaced an old or faulty motherboard and left everything else as it was. It's because I was using "md" RAID's that this worked. Now I have a great big pile of 3ware 8006-2's just in case, but I also use the on-board RAID controllers in SATA/AHCI mode to construct "md" RAID's. I responded to Rahul who started this thread because his requirements seemed to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In this context, every penny counts and we do not throw things away until they are actually dead: Old servers become new compute nodes, and so on. I think that lot of people reading this list are interested in running small Beowulf clusters for relatively small projects, like me. I've found the Beowulf list to be a mine of useful information, but we are not all running huge Beowulf clusters or supporting them commerically. > Really the benefit is that of being against the change of business > conditions for your RAID vendor. If you plan on keeping the same array > active until it dies (4-10 years), this could be a consideration. > However, you also have to worry about disk availability/compatibility, > etc. That is, its not *just* a RAID card issue, its a full stack issue. I agree, and I've been bitten by that for using 'enterprise' grade disks that are no longer available and ended up replacing faulty 250GB drives with 500GB drives just so I could rebuild the RAID after a disk failure. I've just repeated the trick replacing 500GB drives with 1TB. It's OK if the replacement drive is bigger, and you're using LBA so drive geometry doesn't matter. > MD allows you to reduce the risk in various portions of this stack. Indeed it does, but I think it would be better with reliable hot-swap! Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From rpnabar at gmail.com Tue Jan 19 15:06:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 19 Jan 2010 17:06:27 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B56246E.2050505@abdn.ac.uk> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> <4B55FD52.206@scalableinformatics.com> <4B56246E.2050505@abdn.ac.uk> Message-ID: On Tue, Jan 19, 2010 at 3:30 PM, Tony Travis wrote: > I responded to Rahul who started this thread because his requirements seemed > to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In this > context, every penny counts and we do not throw things away until they are > actually dead: Old servers become new compute nodes, and so on. I think that > lot of people reading this list are interested in running small Beowulf > clusters for relatively small projects, like me. I've found the Beowulf list > to be a mine of useful information, but we are not all running huge Beowulf > clusters or supporting them commerically. I don't know about the others on the list, but you describe my situation pretty accurately Tony! :) Small budget, primitive hardware that's rarely retired etc. Sounds familiar. -- Rahul From jac67 at georgetown.edu Tue Jan 19 15:26:30 2010 From: jac67 at georgetown.edu (Jess Cannata) Date: Tue, 19 Jan 2010 18:26:30 -0500 Subject: [Beowulf] Parallel file systems In-Reply-To: References: Message-ID: <4B563FA6.4000204@georgetown.edu> On 01/13/2010 06:40 AM, tegner at renget.se wrote: > While starting to investigating different storage solutions I came across > gluster (www.gluster.com). I did a search on beowulf.org and came up with > nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. > > Anyone with experience of gluster in HPC? > > Yes, we've been using Glusterfs on one of our lightly used Infiniband clusters (32-nodes, 256 cores). We have found it to be pretty easy to configure and we have liked its performance. If you want more information, you should e-mail Joe Landman, who is also on the list. He's used it in several large setups. > Regards, > > /jon > > From landman at scalableinformatics.com Tue Jan 19 15:51:16 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 18:51:16 -0500 Subject: [Beowulf] Parallel file systems In-Reply-To: <4B563FA6.4000204@georgetown.edu> References: <4B563FA6.4000204@georgetown.edu> Message-ID: <4B564574.9010001@scalableinformatics.com> Jess Cannata wrote: > > > On 01/13/2010 06:40 AM, tegner at renget.se wrote: >> While starting to investigating different storage solutions I came across >> gluster (www.gluster.com). I did a search on beowulf.org and came up with >> nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. >> >> Anyone with experience of gluster in HPC? >> >> > Yes, we've been using Glusterfs on one of our lightly used Infiniband > clusters (32-nodes, 256 cores). We have found it to be pretty easy to > configure and we have liked its performance. If you want more > information, you should e-mail Joe Landman, who is also on the list. > He's used it in several large setups. How did I not see this ... mea culpa Yes, we are using GlusterFS in multiple sites with multiple users. Getting excellent performance out of it, as long as the IB can keep up. Long story ask me over beer some day ... We are generating multiple quotes/RFP responses with it (one is going out literally right now). Bug me offline if you'd like. Joe >> Regards, >> >> /jon >> >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From brs at usf.edu Sat Jan 16 13:14:57 2010 From: brs at usf.edu (Brian Smith) Date: Sat, 16 Jan 2010 16:14:57 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: Message-ID: <1263676497.14044.15.camel@voltaire.rc.usf.edu> I'm using this in our environment. I've simply added the Modules environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use puppet to manage everything, so this is easy). This ensures that Modules is properly integrated with your environment regardless of whether you are using an interactive or non-interactive invocation of these shells. This works for SGE (I'm on 6.2u4, ATM) Users can then do the following: 1. Include 'module add' directives in their job scripts for application execution (preferred method) 2. Use persistent module add directives w/ 'module initadd' to ensure their jobs have the correct environment settings (good for interactive jobs via qrsh, but this is better solved in other ways). Here's what I added: ## for bash if [ -d "/opt/admin/Modules" ]; then MODULE_VERSION=3.2.6 MODULE_ROOT=/opt/admin/Modules/$MODULE_VERSION case "$0" in -sh|sh|*/sh) modules_shell=sh ;; -ksh|ksh|*/ksh) modules_shell=ksh ;; -zsh|zsh|*/zsh) modules_shell=zsh ;; -bash|bash|*/bash) modules_shell=bash ;; *) modules_shell=bash ;; esac MODULEPATH=$MODULE_ROOT/modulefiles:$HOME/.modulefiles export WORK SCRATCH MODULEPATH MODULE_ROOT MODULE_VERSION module() { eval `$MODULE_ROOT/bin/modulecmd $modules_shell $*`; } if [ -f $HOME/.modules ]; then eval `egrep '^module(.*load|.*add).*$' $HOME/.modules | head -1` fi fi ## For cshell if ( -d /opt/admin/Modules ) then setenv MODULESHOME /opt/admin/Modules/${MODULE_VERSION} if (! $?MODULEPATH ) then setenv MODULEPATH `sed 's/#.*$//' ${MODULESHOME}/init/.modulespath | awk 'NF==1{printf("%s:",$1)}'` endif if (! $?LOADEDMODULES ) then setenv LOADEDMODULES "" endif if ( -f $HOME/.modules ) then eval `egrep '^module(.*add|.*load).*$' $HOME/.modules | head -1` endif endif -- Brian Smith Senior Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB308 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu On Sat, 2010-01-16 at 10:44 +0000, madskaddie at gmail.com wrote: > Greetings, > > I'm using gridengine (6.2u4, open source ver.) and I would like to use > the Modules software. Modules uses a shell function that must be > exported (bash: "export -f func_name" in order to set environment > variables), but gridengine has a bug related with bash exported > functions[1]. > > Is anybody using gridengine, bash and modules? How to solve this? > Changing shell is not an option ;) > > This issue is also being discussed here[2]. > > Thanks, > > Gil > > [1] - http://gridengine.sunsource.net/issues/show_bug.cgi?id=2173 > [2] - http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&viewType=browseAll&dsMessageId=238562#messagefocus > From brs at usf.edu Mon Jan 18 09:37:38 2010 From: brs at usf.edu (Brian Smith) Date: Mon, 18 Jan 2010 12:37:38 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> Message-ID: <1263836258.10961.21.camel@voltaire.rc.usf.edu> Ah, the RedHat-isms that we take for granted... hah! I forgot that the default ~/.bashrc I push out to everyone sources /etc/bashrc by default. What distro are you using? There's also this bit of goodness from the man page: "Bash attempts to determine when it is being run with its standard input connected to a a network connection, as if by the remote shell daemon, usually rshd, or the secure shell daemon sshd. If bash determines it is being run in this fashion, it reads and executes commands from ~/.bashrc, if that file exists and is readable. It will not do this if invoked as sh. The --norc option may be used to inhibit this behavior, and the --rcfile option may be used to force another file to be read, but rshd does not generally invoke the shell with those options or allow them to be specified." I wonder if sge_shepherd doesn't, in fact, trick shells into behaving this way... I know I'm not using BASH_ENV and my modules environment works correctly. -Brian -- Brian Smith Senior Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB308 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: > 2010/1/16 Brian Smith : > > I'm using this in our environment. I've simply added the Modules > > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > > puppet to manage everything, so this is easy). This ensures that > > Modules is properly integrated with your environment regardless of > > whether you are using an interactive or non-interactive invocation of > > these shells. This works for SGE (I'm on 6.2u4, ATM) > > > > But it seems that gridengine spawns like "bash script_name" so no rc > files are read. Reading bash manpage, I found the BASH_ENV environment > variable: > > """ > When bash is started non-interactively, to run a shell script, for > example, it looks for the variable BASH_ENV in the environment, > expands its value if it appears there, and uses the expanded value as > the name of a file to read and execute. Bash behaves as if the > following command were executed: > if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi > but the value of the PATH variable is not used to search for the file name. > """ > (bash manpage) > > Right now I'm setting this variable and with the "-V" job submission > flag it's working well (it does not work correctly without it) > > Gil > > From dmitri.chubarov at gmail.com Mon Jan 18 01:26:00 2010 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Mon, 18 Jan 2010 15:26:00 +0600 Subject: [Beowulf] HPC/mpi courses In-Reply-To: <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> References: <20100117112448.GA1181@wyddfa.dongle.org.uk> <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> Message-ID: Jonathan, thanks for the tip regarding the O'Reilly book. I did a search for it and found out that it has been made available for download by the author of the second edition, Dr. Charles Severance http://cnx.org/content/col11136/latest/. The first edition written by Kevin Dowd was going out of print in 1996 when Charles Severance took charge of the second edition and put it online when the second edition ran out. On software optimization for HPC we found the SIAM "cheetah" book "Performance Optimization of Numerically Intensive Codes" by Stefan Goedecker and Adolfy Hoise to be often referred to as the standard reference. The other two books I found on Safari were "Software Optimization for High-Performance Computing" By Kevin R. Wadleigh and Isom L. Crawford published in 2000 with an emphasis on linear algebra and signal processing applications and "The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit published in 2008 that is really good on datastructures and "non-numerical" algorithms. There are probably many more books published by universities as online courses. Also I know a few undergraduate level textbooks in Russian that are unlikely to be ever translated into English. Dima On Mon, Jan 18, 2010 at 5:07 AM, Jonathan Dursi wrote: > > On 2010-01-17, at 6:24AM, Rob Horton wrote: > > On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: > >> > >> do you know of any official courses run in Europe, or Asia covering > >> HPC system, or development. mpi or new distributed memory paradigms > >> are welcome. > > > > NAG run various courses on behalf of HECToR in the UK: > > http://www.hector.ac.uk/cse/training/ > > We have videos and slides up of a week-long MPI/OpenMP course we teach at SciNet at the University of Toronto: > > ? ? ? ?http://www.cita.utoronto.ca/~ljdursi/PSP/ > > Videos online are no substitute for being in the classroom yourself, of course, but it's better than nothing. > > Along those lines, does anyone have a good HPC / parallel computing textbook to get users started? ? There are (say) passable MPI books, or OpenMP, or even on the Intel thread building block ?stuff, but very little that integrates everything that I can find. > > Similarly with performance issues; O'Reilly used to have a pretty solid little book on HPC book which was very nice for teaching people to think about serial optimization, but the last edition was 1998 and I can't find anything comparable. > > ? - Jonathan > > -- > Jonathan Dursi > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From brs at usf.edu Mon Jan 18 13:15:10 2010 From: brs at usf.edu (Brian Smith) Date: Mon, 18 Jan 2010 16:15:10 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: <1263849310.2531.3.camel@plato> On Mon, 2010-01-18 at 19:53 +0000, madskaddie at gmail.com wrote: > 2010/1/18 Brian Smith : > > Ah, the RedHat-isms that we take for granted... hah! I forgot that the > > default ~/.bashrc I push out to everyone sources /etc/bashrc by default. > > What distro are you using? > > > > Debian lenny > > > There's also this bit of goodness from the man page: > > > > "Bash attempts to determine when it is being run with its standard input > > connected to a a network connection, as if by the remote shell daemon, > > usually rshd, or the secure shell daemon sshd. > > > > The Debian bash man page doesn't say the word "sshd" (only "rshd"), > and I'm using ssh as the remote shell, so it may be the case (weird, > but possible). > > (...) > > > > > I wonder if sge_shepherd doesn't, in fact, trick shells into behaving > > this way... I know I'm not using BASH_ENV and my modules environment > > works correctly. > > > > Just to be sure we aren't missing something: you can load a module > inside the submit job, correct? > > Case 1: > > - module load something > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > mpirun ... > > #EOF (Unless you are doing qsub -V) This should be - module initadd something - qsub job.sh ... Remember, you need a "module load null" line in ~/.bashrc or, in my case, ~/.modules. This makes sure the module is loaded when bash starts. > Case 2 (what I pretend): > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > module add something > mpirun ... > > #EOF > > > > > > > > -Brian > > > > -- > > Brian Smith > > Senior Systems Administrator > > IT Research Computing, University of South Florida > > 4202 E. Fowler Ave. ENB308 > > Office Phone: +1 813 974-1467 > > Organization URL: http://rc.usf.edu > > > > > > On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: > >> 2010/1/16 Brian Smith : > >> > I'm using this in our environment. I've simply added the Modules > >> > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > >> > puppet to manage everything, so this is easy). This ensures that > >> > Modules is properly integrated with your environment regardless of > >> > whether you are using an interactive or non-interactive invocation of > >> > these shells. This works for SGE (I'm on 6.2u4, ATM) > >> > > >> > >> But it seems that gridengine spawns like "bash script_name" so no rc > >> files are read. Reading bash manpage, I found the BASH_ENV environment > >> variable: > >> > >> """ > >> When bash is started non-interactively, to run a shell script, for > >> example, it looks for the variable BASH_ENV in the environment, > >> expands its value if it appears there, and uses the expanded value as > >> the name of a file to read and execute. Bash behaves as if the > >> following command were executed: > >> if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi > >> but the value of the PATH variable is not used to search for the file name. > >> """ > >> (bash manpage) > >> > >> Right now I'm setting this variable and with the "-V" job submission > >> flag it's working well (it does not work correctly without it) > >> > >> Gil > >> > >> > > > > > > > From rchang.lists at gmail.com Mon Jan 18 21:29:52 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Tue, 19 Jan 2010 10:59:52 +0530 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: <4B554350.5070506@gmail.com> Joe Landman wrote: > > Its a software RAID implementation pretending to be a hardware RAID > implementation. They are rarely if ever as good as MD. Many of them > in Linux will invoke dm (the "other" RAID engine) as dm has "support" > for fake-raid. Note that we have lost data (multiple times) with > dm+fake-raid in testing, so we don't recommend its use in important > machines (ones which you can't afford to lose). This could be due to > bad drivers for the chips in question, but we aren't taking chances. > > Hello Joe, I would like to know specifically what models of LSI boxes are software RAID implementation pretending to be a hardware RAID implementation. I have a few LSI boxes where I work, and your post made me think if they really are Hardware RAID implementation. I have the LSI 2822(old), LSI 4900 & also LSI 7900 controllers based storage. How do we differentiate between the software and hardware RAID implementations. ANY visual difference?, are they identifiable?. Richard. From rchang.lists at gmail.com Tue Jan 19 08:44:59 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Tue, 19 Jan 2010 22:14:59 +0530 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55E18B.20108@gmail.com> Joe Landman wrote: > > Hi Richard: > > This I cannot tell you, as I don't have a comprehensive list of what > uses what driver. I'd suggest looking at what drivers it loads for > disks when it comes up. If dmraid comes up *and* enumerates devices, > you have a strong probability that it is a fake-raid. This is not to > say dmraid is bad. Again, its the underlying driver or chipset that > we often run into problems with. Thanks Joe, This is a much better picture. I am sure there is no such thing as "dmraid" coming up when the system I maintain, starts. > > > Rarely. Fake raid will generally not have any RAM cache or battery > backup capability. I am also sure, all the storage controllers that I have mentioned have a Battery and RAM Cache. > In some instances, fake raid is *ok* for OS drives (RAID1 only), if > the bios is smart enough to use it correctly, the underlying fake raid > driver is relatively stable, and you have reasonable disks. > Otherwise, mdadm works great, though you have to patch Redhat/Centos, > as they, by default, use dmraid for the moment. Later model Fedora > appear to have switched to MD raid (after 9 from what I saw, last time > I played with it). I really appreciate your effort and the time taken to reply back. Thanks, Richard From sabujp at gmail.com Tue Jan 19 15:56:53 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 19 Jan 2010 17:56:53 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: <4B563FA6.4000204@georgetown.edu> References: <4B563FA6.4000204@georgetown.edu> Message-ID: Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, Lustre, and XSan/Storenext. Although, my gripe with it is that currently quotas work at the filesystem level and not at the gluster level and this gets messy if you've got a stripe and cluster+distributed setup across the same filesystems on multiple storage bricks. That is, if you fill up your quota on one node in the gluster mounted directory /dist/user (cluster+distributed) and then try to write to /stripe/user (the gluster stripe across all nodes), writes to the stripe will fail. Writes to the distributed directory continue until you run out of your quota on all the nodes. Ideally one should setup multiple filesystems across the nodes, each for the different types of read/write methods, but this isn't always possible or desirable especially if you want one "global" filesystem space. Otherwise, performance is great using infiniband especially to the stripes. Are there other clustered FS out there that use infi, maybe lustre? There's also heavy development on the codebase so it's continually evolving and improving. HTH, Sabuj Pattanayek On Tue, Jan 19, 2010 at 5:26 PM, Jess Cannata wrote: > > > On 01/13/2010 06:40 AM, tegner at renget.se wrote: >> >> While starting to investigating different storage solutions I came across >> gluster (www.gluster.com). I did a search on beowulf.org and came up with >> nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. From kilian.cavalotti.work at gmail.com Wed Jan 20 23:18:54 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 21 Jan 2010 08:18:54 +0100 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: On Wed, Jan 20, 2010 at 12:56 AM, Sabuj Pattanayek wrote: > Otherwise, performance is great using infiniband especially to the > stripes. Are there other clustered FS out there that use infi, maybe > lustre? Yup, plenty of them: Lustre, GPFS, pNFS... Cheers, -- Kilian From d.love at liverpool.ac.uk Thu Jan 21 08:51:45 2010 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 21 Jan 2010 16:51:45 +0000 Subject: [Beowulf] Re: Gridengine and bash + Modules In-Reply-To: <1263849310.2531.3.camel@plato> (Brian Smith's message of "Mon, 18 Jan 2010 16:15:10 -0500") References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> <1263849310.2531.3.camel@plato> Message-ID: <87eiljcsf2.fsf@liv.ac.uk> I saw this late after reuti referred to it on the SGE list where it was originally raised, and I'm not sure I've got the whole thread. Could someone explain what the problem actually is with SGE and modules, if there really is one? I think there isn't. As far as I (and reuti?) can tell, there's no general problem with `qsub -V' with loaded modules (which is probably natural for command-line submission and what I typically do), or with explicitly loading modules in a job script (which isn't affected by communication within SGE anyway). There is a general problem (and corresponding SGE bug report) with exported shell variables or function definitions with multi-line values, but modules doesn't export the function definition, and its body could be trivially re-written as a simple command (eliding `{...;}') if necessary. From hahn at mcmaster.ca Thu Jan 21 14:15:27 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 21 Jan 2010 17:15:27 -0500 (EST) Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: > Otherwise, performance is great using infiniband especially to the > stripes. I find that people have greatly differing expectations, so what does "great" mean to you? I would expect to achieve ~90% of the peak bandwidth of the interconnect, for instance, assuming enough and fast-enough storage targets. this is easy enough to achieve using Lustre in my experience (albeit I haven't really tried eg QDR IB.) From sabujp at gmail.com Thu Jan 21 14:51:58 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 21 Jan 2010 16:51:58 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: On Thu, Jan 21, 2010 at 4:15 PM, Mark Hahn wrote: >> Otherwise, performance is great using infiniband especially to the >> stripes. > > I find that people have greatly differing expectations, so what does "great" > mean to you? The stripe is across 5 machines and I'm getting about 3 times speedup on writes and 2.5 times speedup on reads vs the same tests going directly to the XFS filesystem on a single node. Not hitting anything close to the peak interconnect bandwidth, about 4gbps on writes and 7gbps on reads. > > I would expect to achieve ~90% of the peak bandwidth of the interconnect, > for instance, assuming enough and fast-enough storage targets. ?this is easy > enough to achieve using ?Lustre in my experience (albeit I haven't really > tried eg QDR IB.) What about your setup above? What sort of speedup are you getting vs a single node, assuming your (storage) nodes are homogeneous in terms of hardware. From robl at mcs.anl.gov Fri Jan 22 07:47:00 2010 From: robl at mcs.anl.gov (Rob Latham) Date: Fri, 22 Jan 2010 09:47:00 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: <20100122154659.GA3743@mcs.anl.gov> On Tue, Jan 19, 2010 at 05:56:53PM -0600, Sabuj Pattanayek wrote: > Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, > Lustre, and XSan/Storenext. I haven't set up any gluster systems, but a point of pride for the PVFS project is our ease of deployment. I'd be interested to hear ways that gluster is even easier to deploy than PVFS > Otherwise, performance is great using infiniband especially to the > stripes. Are there other clustered FS out there that use infi, maybe > lustre? There's also heavy development on the codebase so it's > continually evolving and improving. PVFS has had infiniband support for quite some time. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA From christiansuhendra at gmail.com Thu Jan 21 23:25:48 2010 From: christiansuhendra at gmail.com (christian suhendra) Date: Fri, 22 Jan 2010 16:55:48 +0930 Subject: [Beowulf] mpi_cart_coords : invalid rank Message-ID: hello guys... may ask your advice.. i have a problem here when i connected my program on MPICH i've got this error : root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 ./canon Process 0 of 1 on cluster3 Total Time: 4.161119 msecs root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 2 ./canon Process 1 of 2 on cluster3 Process 0 of 2 on cluster3 0 - MPI_CART_COORDS : Invalid rank [0] Aborting program ! [0] Aborting program! child process exited unexpectedly 0 aborted for information : my RSH and NTFS are working.. please help me..i really need your advice thank you very much regards, christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Fri Jan 22 09:27:41 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 22 Jan 2010 11:27:41 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: <20100122154659.GA3743@mcs.anl.gov> References: <4B563FA6.4000204@georgetown.edu> <20100122154659.GA3743@mcs.anl.gov> Message-ID: On Fri, Jan 22, 2010 at 9:47 AM, Rob Latham wrote: > On Tue, Jan 19, 2010 at 05:56:53PM -0600, Sabuj Pattanayek wrote: >> Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, >> Lustre, and XSan/Storenext. > > I haven't set up any gluster systems, but a point of pride for the > PVFS project is our ease of deployment. ?I'd be interested to hear > ways that gluster is even easier to deploy than PVFS Haven't set up PVFS, but after reading some articles I get the feeling that it can only write data using a striped method across IOD's. Gluster gives you a bit more flexibility and robustness since there's also the distribute write method and mirroring options. From mdidomenico4 at gmail.com Fri Jan 22 10:33:08 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 13:33:08 -0500 Subject: [Beowulf] rhel page_size Message-ID: does anyone know if it's still possible to change the default page_size from 4k to something larger on RHEL v5 x86_64? My efforts to recompile the kernel with a larger page size are failing me, but i might be doing it wrong... thanks From chekh at pcbi.upenn.edu Fri Jan 22 10:56:28 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Fri, 22 Jan 2010 13:56:28 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: References: Message-ID: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> On Fri, 22 Jan 2010 13:33:08 -0500 Michael Di Domenico wrote: > does anyone know if it's still possible to change the default > page_size from 4k to something larger on RHEL v5 x86_64? > > My efforts to recompile the kernel with a larger page size are failing > me, but i might be doing it wrong... Hi Michael, What are you actually trying to accomplish? These links may be relevant: http://en.wikipedia.org/wiki/Page_(computer_memory)#Huge_pages http://dank.qemfd.net/dankwiki/index.php/Pages One place where this comes up is trying to increase the blocksize of ext3; my understanding is that it's not possible in practice: http://en.wikipedia.org/wiki/Ext3#cite_note-7 Regards, -- Alex Chekholko chekh at pcbi.upenn.edu From mdidomenico4 at gmail.com Fri Jan 22 11:26:31 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 14:26:31 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> Message-ID: We'd like to run an experiment and see if MAGMA will run any faster/better/different with a larger page size. On Fri, Jan 22, 2010 at 1:56 PM, Alex Chekholko wrote: > On Fri, 22 Jan 2010 13:33:08 -0500 > Michael Di Domenico wrote: > >> does anyone know if it's still possible to change the default >> page_size from 4k to something larger on RHEL v5 x86_64? >> >> My efforts to recompile the kernel with a larger page size are failing >> me, but i might be doing it wrong... > > Hi Michael, > > What are you actually trying to accomplish? > > These links may be relevant: > http://en.wikipedia.org/wiki/Page_(computer_memory)#Huge_pages > http://dank.qemfd.net/dankwiki/index.php/Pages > > One place where this comes up is trying to increase the blocksize of > ext3; my understanding is that it's not possible in practice: > http://en.wikipedia.org/wiki/Ext3#cite_note-7 > > Regards, > -- > Alex Chekholko ? chekh at pcbi.upenn.edu > From landman at scalableinformatics.com Fri Jan 22 11:49:13 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jan 2010 14:49:13 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> Message-ID: <4B5A0139.4050003@scalableinformatics.com> Michael Di Domenico wrote: > We'd like to run an experiment and see if MAGMA will run any > faster/better/different with a larger page size. Larger pages will reduce TLB thrashing/pressure. The best way to tell if you need it is to run performance counter tools during your program run. If you haven't, you might be optimizing something which provides minimal if any benefit. Have you profiled the code (using any of the various tools) to see where it is spending its time? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From mdidomenico4 at gmail.com Fri Jan 22 12:14:36 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 15:14:36 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: <4B5A0139.4050003@scalableinformatics.com> References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: Nope, haven't gotten that far yet. I seemed to recall (or atleast i thought i had) that i could easily change the page size during kernel compile. perhaps that was way back in the day or i'm just losing it. figured it'd be a quick easy test... guest not... :( On Fri, Jan 22, 2010 at 2:49 PM, Joe Landman wrote: > Michael Di Domenico wrote: >> >> We'd like to run an experiment and see if MAGMA will run any >> faster/better/different with a larger page size. > > Larger pages will reduce TLB thrashing/pressure. ?The best way to tell if > you need it is to run performance counter tools during your program run. ?If > you haven't, you might be optimizing something which provides minimal if any > benefit. > > Have you profiled the code (using any of the various tools) to see where it > is spending its time? > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web ?: http://scalableinformatics.com > ? ? ? http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax ?: +1 866 888 3112 > cell : +1 734 612 4615 > From ebiederm at xmission.com Sat Jan 23 05:34:27 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 23 Jan 2010 05:34:27 -0800 Subject: [Beowulf] rhel page_size In-Reply-To: (Michael Di Domenico's message of "Fri\, 22 Jan 2010 15\:14\:36 -0500") References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: Michael Di Domenico writes: > Nope, haven't gotten that far yet. I seemed to recall (or atleast i > thought i had) that i could easily change the page size during kernel > compile. perhaps that was way back in the day or i'm just losing it. > figured it'd be a quick easy test... guest not... :( Sounds like ia64 not x86_64. On x86_64 the hardware supported page sizes are 4K 2M and 1G. Only 4K makes sense for general use. Eric From robl at mcs.anl.gov Mon Jan 25 07:22:20 2010 From: robl at mcs.anl.gov (Rob Latham) Date: Mon, 25 Jan 2010 09:22:20 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> <20100122154659.GA3743@mcs.anl.gov> Message-ID: <20100125152220.GA21173@mcs.anl.gov> On Fri, Jan 22, 2010 at 11:27:41AM -0600, Sabuj Pattanayek wrote: > Haven't set up PVFS, but after reading some articles I get the feeling > that it can only write data using a striped method across IOD's. > Gluster gives you a bit more flexibility and robustness since there's > also the distribute write method and mirroring options. There are a couple options with regards to PVFS data distribution. "stripe across all" is one, but it's not difficult to change that to stripe across one or some. software mirroring, it's true, remains a research effort. In practice, hardware redundancy serves many sites well. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA From eagles051387 at gmail.com Mon Jan 25 07:28:50 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 25 Jan 2010 16:28:50 +0100 Subject: [Beowulf] clustering using xen virtualized machines Message-ID: has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From h-bugge at online.no Mon Jan 25 07:30:28 2010 From: h-bugge at online.no (=?iso-8859-1?Q?H=E5kon_Bugge?=) Date: Mon, 25 Jan 2010 16:30:28 +0100 Subject: [Beowulf] rhel page_size In-Reply-To: References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: On Jan 22, 2010, at 21:14 , Michael Di Domenico wrote: > Nope, haven't gotten that far yet. I seemed to recall (or atleast i > thought i had) that i could easily change the page size during kernel > compile. perhaps that was way back in the day or i'm just losing it. > figured it'd be a quick easy test... guest not... :( I used libhugetlbfs today with success, followed more or less this recipe: http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/ Huge malloc's could then take advantage of the huge page support on your CPU (2MB on x86_64) and you do not need any kernel/apps recompilation. H?kon From mathog at caltech.edu Mon Jan 25 10:46:31 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 25 Jan 2010 10:46:31 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: Is it possible to have the Machine Check Exception (MCE) information saved to disk automatically on the next warm boot? Long form: A K7 node crashed yesterday and left an MCE on the screen which I copied down as: CPU 0 machine check exception 0000000000000007 Bank 1 F000000000000853 Bank 2 940040000000017A at 00000000001511C0 Kernel panic, not syncing, Unable to Continue Copying all of those numbers down is very error prone. As I understand it the MCE values stay in the registers of the CPU after the crash, and may be retrieved at the next warm boot (via a front panel reset, for instance). But this save seems not to happen automatically, or at least I could not find anything that looked like an MCE dump in /var/log or /var/log/kernel when the system came up. So I want to set things up, if possible to save this information to disk. For what its worth, this is on a Tyan S2466, and while on the next warm boot the hardware monitor in the BIOS showed the CPU fan at full speed, when the OS came up lm_sensors showed it at half speed. I have seen this glitch before on other mysterious crashes, and the only way to clear it seems to be to unplug the unit for 10 minutes, allowing time for the errant bit fade away. This is on a 2.6.24.17 kernel. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From Greg at keller.net Mon Jan 25 12:26:35 2010 From: Greg at keller.net (Greg Keller) Date: Mon, 25 Jan 2010 14:26:35 -0600 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: <201001252000.o0PK08ZT020207@bluewest.scyld.com> References: <201001252000.o0PK08ZT020207@bluewest.scyld.com> Message-ID: > Date: Mon, 25 Jan 2010 10:46:31 -0800 > From: "David Mathog" > Subject: [Beowulf] Logging MCE information on next warm boot? > To: beowulf at beowulf.org > Message-ID: > Content-Type: text/plain; charset=iso-8859-1 > > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? David, I believe the utility you are looking for is mcelog. We usually run it with the following arguments: /usr/sbin/mcelog -h --ignorenodev --filter I think it clears the info after it reports it, so make sure to tee it to a file. I don't understand the command or the flags, just a copy / paste script kiddy in these regards, but I hope it helps. Cheers! Greg From deadline at eadline.org Mon Jan 25 15:23:59 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 25 Jan 2010 18:23:59 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> You may want to look at this: Building A Virtual Cluster with Xen http://www.clustermonkey.net//content/view/139/33/ -- Doug > has anyone tried clustering using xen based vm's. what is everyones take > on > that? its something that popped into my head while in my lectures today. > > -- > Jonathan Aquilina > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From deadline at eadline.org Mon Jan 25 15:30:15 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 25 Jan 2010 18:30:15 -0500 (EST) Subject: [Beowulf] WhisperingWulf: A Silent Personal Cluster Message-ID: <47005.192.168.1.1.1264462215.squirrel@mail.eadline.org> For those of you who hate fan noise (or have some free time, aluminum, and want impress you colleagues), have a look at WhisperingWulf: A Silent Personal Cluster http://www.clustermonkey.net//content/view/273/1/ -- Doug From ashley at pittman.co.uk Mon Jan 25 16:12:28 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 26 Jan 2010 00:12:28 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: > has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ebiederm at xmission.com Mon Jan 25 16:17:07 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 25 Jan 2010 16:17:07 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: (Greg Keller's message of "Mon\, 25 Jan 2010 14\:26\:35 -0600") References: <201001252000.o0PK08ZT020207@bluewest.scyld.com> Message-ID: Greg Keller writes: >> Date: Mon, 25 Jan 2010 10:46:31 -0800 >> From: "David Mathog" >> Subject: [Beowulf] Logging MCE information on next warm boot? >> To: beowulf at beowulf.org >> Message-ID: >> Content-Type: text/plain; charset=iso-8859-1 >> >> Is it possible to have the Machine Check Exception (MCE) information >> saved to disk automatically on the next warm boot? > > David, > > I believe the utility you are looking for is mcelog. We usually run it with > the following arguments: > /usr/sbin/mcelog -h --ignorenodev --filter > > I think it clears the info after it reports it, so make sure to tee it to a > file. I don't understand the command or the flags, just a copy / paste script > kiddy in these regards, but I hope it helps. In the case of a panic this won't work. You would need to setup kdump or something like that to capture the panic. This sounds like L1 or L2 cache corruption but I haven't ever had any machine checks on anything before the k8 core. Wow. You are talking about old machines. If machine check registers are kept across reboot there is a reasonable chance that the firmware clears them. Eric From lindahl at pbm.com Mon Jan 25 16:45:15 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 25 Jan 2010 16:45:15 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100126004515.GA8936@bx9.net> On Mon, Jan 25, 2010 at 10:46:31AM -0800, David Mathog wrote: > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? You can use a serial or serial-over-lan console to capture it. You can take a photo of the screen. Be sure to send the magic escapes that turn off screen-blanking: echo -e "\033[9;0]" >/dev/console echo -e "\033[13]" >/dev/console -- greg From eagles051387 at gmail.com Mon Jan 25 21:52:51 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 06:52:51 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References:

Message-ID: woudl somethign like this be useful for lets say setting up a system that works with AI and voice recognition? -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From henning.fehrmann at aei.mpg.de Mon Jan 25 23:58:40 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Tue, 26 Jan 2010 08:58:40 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100126075840.GA4952@gretchen.aei.mpg.de> Hi David, On Mon, Jan 25, 2010 at 10:46:31AM -0800, David Mathog wrote: > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? > > Long form: > > A K7 node crashed yesterday and left an MCE on the screen which I copied > down as: > > CPU 0 machine check exception 0000000000000007 > Bank 1 F000000000000853 > Bank 2 940040000000017A at 00000000001511C0 > Kernel panic, not syncing, Unable to Continue > > Copying all of those numbers down is very error prone. As I understand > it the MCE values stay in the registers of the CPU after the crash, and > may be retrieved at the next warm boot (via a front panel reset, for > instance). But this save seems not to happen automatically, or at least > I could not find anything that looked like an MCE dump in /var/log or > /var/log/kernel when the system came up. So I want to set things up, if > possible to save this information to disk. We loaded the netconsole module. This works at least for the 2.6.27 kernel. AFAIK for older kernel one has to compile it into the kernel. It sends printk messages to a remote syslog-ng server which collects the information. I don't know how much netconsole sends in the case of a panic. netconsole needs paramter: modprobe netconsole netconsole=own_port at onw_ip/NIC,remote_port at remote_IP/remote_mac Cheers, Henning From eagles051387 at gmail.com Tue Jan 26 04:00:34 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 12:00:34 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: does anyone have any benchmarks for I/O in a virtualized cluster? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjrc at sanger.ac.uk Tue Jan 26 05:24:33 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 26 Jan 2010 13:24:33 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: On 26 Jan 2010, at 12:00 pm, Jonathan Aquilina wrote: > does anyone have any benchmarks for I/O in a virtualized cluster? I don't have formal benchmarks, but I can tell you what I see on my VMware virtual machines in general: Network I/O is reasonably fast - there's some additional latency, but nothing particularly severe. VMware can special-case communication between VMs on the same physical host, if required, but that reduces flexibility in moving the VMs around. Disk I/O is fairly poor, especially once the number of virtual machines becomes large. This is hardly surprising - the VMs are contending for shared resources, and there's bound to be more contention in a virtualised setup than in physical machines. In our case (~170 virtual machines running on 9 physical servers, each of which has dual GigE for VM traffic and dual port fibrechannel) Forgive me for using VMware parlance rather than Xen, but hopefully the ideas will be the same. Here are a few things I've noted: 1) Applications with I/O patterns of large numbers of small disk operations are particularly painful (such as our ganglia server with all its thousands of tiny updates to RRD files). We've mitigated this by configuring Linux on this guest to allow a much larger proportion of dirty pages than usual, and to not flush to disk quite so often. OK, so I risk losing more data if the VM goes pop, but this is just ganglia graphing, so I don't really care too much in that particular case. 2) Raw device maps (where you pass a LUN straight through to a single virtual machine, rather than carving the disk out of a datastore) reduce contention and increase performance somewhat, at the cost of using up device minor numbers on ESX quite quickly; because ESX is basically Linux, you're limited to 256 (I think - it might be 128) LUNs presented to each host, and probably to each cluster, since VMs need to be able to migrate. I basically use RDMs for database applications where the storage requirements are greater than about 500 GB. For less than that I use datastores. 3) Keep the number of virtual machines per datastore quite low, especially if the applications are I/O heavy, to reduce contention. 4) In an ideal world I'd spread the datastores over a larger number of RAID units than I currently have, but my budget can't stand that. All this is rather dependent of course on what technology you're using to provide storage to your virtual machines. We're using fibrechannel, but of course mileage may vary considerably if you use NAS or iSCSI, and depending on how many NICs you're bonding together to get bandwidth. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From tjrc at sanger.ac.uk Tue Jan 26 06:24:20 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 26 Jan 2010 14:24:20 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

Message-ID: On 26 Jan 2010, at 1:24 pm, Tim Cutts wrote: > 2) Raw device maps (where you pass a LUN straight through to a > single virtual machine, rather than carving the disk out of a > datastore) reduce contention and increase performance somewhat, at > the cost of using up device minor numbers on ESX quite quickly; > because ESX is basically Linux, you're limited to 256 (I think - it > might be 128) LUNs presented to each host, and probably to each > cluster, since VMs need to be able to migrate. I basically use RDMs > for database applications where the storage requirements are greater > than about 500 GB. For less than that I use datastores. It's been pointed out to me that of course Linux supports a lot more than 256 devices presented. Nevertheless, for some reason ESX does not - presumably it's only using a single major number or something. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eagles051387 at gmail.com Tue Jan 26 06:38:11 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 15:38:11 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

Message-ID: for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? i honestly dont know much about xen as i just got my hands dirty with it. wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lynesh at Cardiff.ac.uk Tue Jan 26 06:48:48 2010 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Tue, 26 Jan 2010 14:48:48 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

Message-ID: <1264517328.2282.47.camel@w609.insrv.cf.ac.uk> On Tue, 2010-01-26 at 13:24 +0000, Tim Cutts wrote: > 1) Applications with I/O patterns of large numbers of small disk > operations are particularly painful (such as our ganglia server with > all its thousands of tiny updates to RRD files). We've mitigated this > by configuring Linux on this guest to allow a much larger proportion > of dirty pages than usual, and to not flush to disk quite so often. > OK, so I risk losing more data if the VM goes pop, but this is just > ganglia graphing, so I don't really care too much in that particular > case. Ganglia thrashes disks even on physical hardware. So I'm not sure it's fair to lay this at the door of VMWare. We run our ganglia on physical hardware and we still have to put the RRDs in a tmpfs partition to stop the disk I/O grinding the server down. Other than that your experience matches what I've seen with my ESX system (which we don't use for HPC). Thanks, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From john.hearns at mclaren.com Tue Jan 26 07:24:19 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 26 Jan 2010 15:24:19 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

Message-ID: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? i honestly dont know much about xen as i just got my hands dirty with it.? wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? Jonathan, in a private reply I've already said that you should not be put off from having bright ideas! In no way wishing to rain on your parade - and indeed wishing you to experiment and keep asking questions, which you are very welcome to do, this has been thought of. Cluster nodes are commonly run without and GUI - commandline only, as you say. The debate comes around on this list every so often about running diskless! The answer is yes, you can run diskless compute nodes, and I do. You boot them over the network, and have an NFS-root filesystem. On many clusters the application software is NFS mounted also. Your point about a SAN is very relevant - I would say that direct, physical fibrechannel SAN connections in a cluster are not common - simply due to the expense of installing the cards and a separate infrastructure. However, iSCSI is used and Infiniband is common in clusters. Apologies - I really don't want to come across as knowing better than you (which I don't). If we don't have people asking 2what if" and "hey - here's a good idea" then you won't make anything new. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From eagles051387 at gmail.com Tue Jan 26 08:34:21 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 17:34:21 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> Message-ID: john i thank you for the encouragement. its better then what i get form certain people i deal with in ubuntu channels. you mention diskless booting using tftp and pxe. the problem though arises when u have a certain number of nodes accessing the same disk simultaneously where disk I/O shoots through the roof. only reason i know this i was helping a professor the 2 yrs that i was in college in the usa before transferring we were creating a small cluster 1 head node and 4 slaves. all diskless. its nice thing to have but one thing that puts me off it is the idea of bogging down the drive. in the case of diskless would it be better then to go to a SSD on the head node? -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Tue Jan 26 08:37:48 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 26 Jan 2010 16:37:48 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F18BB.8000406@sas.upenn.edu> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> > > Is it just me, or does HPC clustering and virtualization fall on > opposite ends of the spectrum? > Gavin, not necessarily. You could have a cluster of HPC compute nodes running a minimal base OS. Then install specific virtual machines with different OS/software stacks each time your run a job. OK, this is probably more relevant for grid or cloud computing - I first thought this would be a good idea when seeing that (at the time) the CERN LHC Grid software would only run with Redhat 7.2 So you could imagine 'packaging up' a virtual machine which has your particular OS flavour/libraries/compilers and shipping it out with the job. Another reason could be fault tolerance - you run VMs on the compute nodes. When you detect a hardware fault is coming along (eg from ECC errors or disk errors) you perform a live migration from one node to another - and your job keeps on trucking. (In theory, checkpointing needed etc. etc.) The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From mathog at caltech.edu Tue Jan 26 08:38:57 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 26 Jan 2010 08:38:57 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: Henning Fehrmann wrote: > We loaded the netconsole module. This works at least for the > 2.6.27 kernel. AFAIK for older kernel one has to compile it into the kernel. Ah good idea, and this distro already has that, but it isn't enabled by default. I see how to configure it and turn it on. Will a logger message for "kern" test it, or is there some other way to force a printk? I'm afraid the logger method might look like it is working, but just go through the usual syslog channels instead of netconsole. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Tue Jan 26 10:46:40 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 26 Jan 2010 10:46:40 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: > David Mathog wrote: > Will a logger > message for "kern" test it, or is there some other way to force a > printk? I'm afraid the logger method might look like it is working, but > just go through the usual syslog channels instead of netconsole. Too optimistic. With netconsole (supposedly) running on the local node logger -p kern.err "test from me" only shows up in the log file on that node. No chance of confusion ;-). There is no explicit network logging of kern.err in /etc/syslog.conf, since I figured syslog is never going to be able to actually log anything after a kernel error. dmesg shows that netconsole started and thinks it is working: netconsole: local port 6666 netconsole: local IP 192.168.1.20 netconsole: interface eth0 netconsole: remote port 514 netconsole: remote IP 192.168.1.220 netconsole: remote ethernet address 00:30:48:59:f8:ff console [netcon0] enabled netconsole: network logging started However, absolutely nothing comes over netconsole when a node reboots. Searched a lot and finally found out how to test netconsole: [root at monkey20 rc6.d]# echo 'p' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 't' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger and it generated these on the syslogd machine Jan 26 10:21:12 monkey20.cluster SysRq : Jan 26 10:21:12 monkey20.cluster Show Regs Jan 26 10:21:35 monkey20.cluster SysRq : Jan 26 10:21:35 monkey20.cluster Show State Jan 26 10:21:52 monkey20.cluster SysRq : Jan 26 10:21:52 monkey20.cluster Show Memory Notice the contentless messages, which were the same as on the video console. This is a log level issue, change it with dmesg or [root at monkey20 rc6.d]# echo '9' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger and then a pile of memory information shows up on both the syslog side and the video console. The default log level on these machines is 3. If the kernel panics with it set to that, will the messages that result be "contentless", like the ones above? Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From vanallsburg at hope.edu Tue Jan 26 11:37:10 2010 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 26 Jan 2010 14:37:10 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References:

Message-ID: <4B5F4466.9020403@hope.edu> Ashley Pittman wrote: > On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: > > >> has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. >> > > I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. > > Ashley, > > Hi Ashley, I'd love to try clustering on Amazon. Is there a good writeup somewhere on how to configure & use mpi in the cloud? Thanks! Paul -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ From eagles051387 at gmail.com Tue Jan 26 12:28:53 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 21:28:53 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F4466.9020403@hope.edu> References:

<4B5F4466.9020403@hope.edu> Message-ID: do you guys think that virtualized clustering is the future? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Tue Jan 26 15:18:40 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 26 Jan 2010 18:18:40 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> Message-ID: >> Is it just me, or does HPC clustering and virtualization fall on >> opposite ends of the spectrum? depends on your definitions. virtualization certainly conflicts with those aspects of HPC which require bare-metal performance. even if you can reduce the overhead of virtualization, the question is why? look at the basic sort of HPC environment: compute nodes running a single distro, controlled by a scheduler. from the user's or job's perspective, there are just some nodes - which ones doesn't matter, or even how many in total. the user _should_ be able to assume that when they land on a node, it behaves as if freshly installed and booted de novo. we don't reboot nodes nodes between jobs, of course, or even make much effort towards preventing a serial job from noticing other serial jobs on the same node (as containers would, let alone VMs). but we could, without tons of effort, just lower utilization. virtualization is about a few things: - improve utilization by coalescing low-duty-cycle services. - isolate services from each other - either to directly arbitrate runtime resource contention, or to disentangle configurations. - encapsulate all the state of a server so it can be moved. I think the first axis is quite non-HPC, since I don't think of HPC jobs as being like idle services. (OTOH, many clusters have good utilization because multiple workloads get interleaved _above_ the processor level.) the second factor is not often an HPC problem, at least not in my experience, where J Random Fortran user doesn't really care that much about the environment (ie - want f77 and lapack and empty queues). migration has some HPC appeal, since it permits defragmenting a cluster, as well as better preemption. > Gavin, not necessarily. You could have a cluster of HPC compute nodes > running a minimal base OS. > Then install specific virtual machines with different OS/software stacks > each time your run a job. or for each job, just install the provided OS image on the bare metal... your job's done, have it halt or reboot the node ;) > OK, this is probably more relevant for grid or cloud computing - I first grid and cloud computing are all part of the same game, no? along with massively parallel low-latency MPI, old-style vector supercomputing, GPU-assisted computing, throughput serial farming, etc. > thought this would be a good idea when seeing > that (at the time) the CERN LHC Grid software would only run with Redhat > 7.2 > So you could imagine 'packaging up' a virtual machine which has your > particular OS flavour/libraries/compilers and shipping > it out with the job. right, that's one of the axes of the problem-space: whether the app gets its own custom runtime environment (in the sense of kernel, libc, etc). another axis is the degree to which the app has to contend for resources (as in an overcommited normal cluster, or a VM without guaranteed resources.) > Another reason could be fault tolerance - you run VMs on the compute > nodes. When you detect a hardware fault is coming along > (eg from ECC errors or disk errors) you perform a live migration from > one node to another - and your job keeps on trucking. > (In theory, checkpointing needed etc. etc.) I'm pretty skeptical about this - the main issue with checkpointing is when there are external side-effects. checkpointing networked apps (including MPI) is hard because you have state "in flight", so can only freeze-dry the state by quiescing (letting the messages land, etc). the "live migration" demos I've seen have been apps that are tolerant to the loss in-flight transactions (or which retry automatically). so I don't think virt is any kind of paradigm-changer, just like manycore merely stretches existing definitions. -mark From dag at sonsorol.org Tue Jan 26 16:02:57 2010 From: dag at sonsorol.org (Chris Dagdigian) Date: Tue, 26 Jan 2010 19:02:57 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B5F82B1.10805@sonsorol.org> One of the virtualization trends I do see in HPC/clustering is in the area of packaging up entire scientific applications into their own custom VMs which contain all the necessary libraries, software dependencies etc. There is a performance hit now and implementation is clunky but I can see cases where "each app sits in its own VM" and the VMs get launched across a cluster would helpful. This sort of work is trending upwards with Amazon AWS and other infrastructure providers - it can be easier to blast your workflow out into 'the cloud' if it's all wrapped up in a self contained and super portable VM. Given how many different versions of R and other core tools like Perl etc. that I need to support on heterogenous scientific clusters this could be a good trend, heh. Just my $.02 -Chris From eagles051387 at gmail.com Tue Jan 26 23:16:48 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 08:16:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F82B1.10805@sonsorol.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> Message-ID: chris not only the vm being portable yes you would take a hit yet from my research into xen it seems like the paid version of citrix xen server has some other nice features such as migration to a back up machine in case of hardware failure. when you all say performance hit how much of a hit are we talking about. also if you guys are running on bare metal a number of complex computations arent you sharing resources that way as well? On Wed, Jan 27, 2010 at 1:02 AM, Chris Dagdigian wrote: > > One of the virtualization trends I do see in HPC/clustering is in the area > of packaging up entire scientific applications into their own custom VMs > which contain all the necessary libraries, software dependencies etc. > > There is a performance hit now and implementation is clunky but I can see > cases where "each app sits in its own VM" and the VMs get launched across a > cluster would helpful. > > This sort of work is trending upwards with Amazon AWS and other > infrastructure providers - it can be easier to blast your workflow out into > 'the cloud' if it's all wrapped up in a self contained and super portable > VM. > > Given how many different versions of R and other core tools like Perl etc. > that I need to support on heterogenous scientific clusters this could be a > good trend, heh. > > Just my $.02 > > -Chris > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsten.aulbert at aei.mpg.de Tue Jan 26 23:33:24 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed, 27 Jan 2010 08:33:24 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <201001270833.25086.carsten.aulbert@aei.mpg.de> Hi David, On Tuesday 26 January 2010 19:46:40 David Mathog wrote: > The default log level on these machines is 3. If the kernel panics with > it set to that, will the messages that result be "contentless", like the > ones above? Try dmesg -n 8 to raise the logging level and try echo '<7>David Test' > /dev/kmsg That should produce output: Jan 27 08:32:24 10.10.12.43 [3098843.050122] David Test The 7 is the logging "severity", try that with <0> and you should send a message to anyone on the system. Does this help? cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/3 CaCert Assurer | Get free certificates from http://www.cacert.org/ From henning.fehrmann at aei.mpg.de Wed Jan 27 00:29:58 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Wed, 27 Jan 2010 09:29:58 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100127082958.GA3644@gretchen.aei.mpg.de> Hi David, On Tue, Jan 26, 2010 at 10:46:40AM -0800, David Mathog wrote: > > David Mathog wrote: > > Will a logger > > message for "kern" test it, or is there some other way to force a > > printk? I'm afraid the logger method might look like it is working, but > > just go through the usual syslog channels instead of netconsole. > > Too optimistic. With netconsole (supposedly) running on the local node Correct. > > logger -p kern.err "test from me" > > only shows up in the log file on that node. No chance of confusion ;-). > There is no explicit network logging of kern.err in /etc/syslog.conf, > since I figured syslog is never going to be able to actually log > anything after a kernel error. > > dmesg shows that netconsole started and thinks it is working: > > netconsole: local port 6666 > netconsole: local IP 192.168.1.20 > netconsole: interface eth0 > netconsole: remote port 514 > netconsole: remote IP 192.168.1.220 > netconsole: remote ethernet address 00:30:48:59:f8:ff > console [netcon0] enabled > netconsole: network logging started > > However, absolutely nothing comes over netconsole when a node reboots. > > Searched a lot and finally found out how to test netconsole: > > [root at monkey20 rc6.d]# echo 'p' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 't' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger > > and it generated these on the syslogd machine > > Jan 26 10:21:12 monkey20.cluster SysRq : > Jan 26 10:21:12 monkey20.cluster Show Regs > Jan 26 10:21:35 monkey20.cluster SysRq : > Jan 26 10:21:35 monkey20.cluster Show State > Jan 26 10:21:52 monkey20.cluster SysRq : > Jan 26 10:21:52 monkey20.cluster Show Memory > > Notice the contentless messages, which were the same as on the video > console. This is a log level issue, change it with dmesg or > > [root at monkey20 rc6.d]# echo '9' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger > > and then a pile of memory information shows up on both the syslog side > and the video console. > > The default log level on these machines is 3. If the kernel panics with > it set to that, will the messages that result be "contentless", like the > ones above? Hmmm, we had no kernel panics since we set up netconsole. I also don't know how much a NIC is affected by a panic. I tried to find something in the kernel source. At least the panic message has the log level KERN_EMERG so something should go through. I guess it is a matter of experience. I'd start with log level 7 which can be reduced any time. Cheers, Henning From john.hearns at mclaren.com Wed Jan 27 01:24:13 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 27 Jan 2010 09:24:13 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

Message-ID: <4B5F0929.7080800@sanger.ac.uk> On the AWS ec2 side, we've been performing a range of tests including full genome sequencing pipelines across varying numbers of nodes and storage. The biggest challenge to date has been IO, particularly if the smaller image systems are used. Where jobs are highly cpu bound, little network (or heaven forbid disk) bound things go reasonably well and have the potential to scale. Once IO becomes a factor the scaling decreases rapidly... We've also had a run around with Xen and it requires more network tiffling to automate role outs (at least in our environment) but it works ok, especially when paired with something like openQRM. It's a ways off being as polished as VMware and some of the interesting memory handling doesn't appear to be all there. As a result performance degrades rapidly as the number of hosts and IO hungry app load increases fairly severely. Regrettably I don't have enough useful data to present this at present and as always YMMV. Pete > I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. > > Ashley, > > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From carlosf at cesga.es Tue Jan 26 08:25:31 2010 From: carlosf at cesga.es (Carlos Fernandez Sanchez) Date: Tue, 26 Jan 2010 17:25:31 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: <040618AB773740DDA96583DAD8ED476B@pccarlosf2> Another reference that might be worth lookig at: Executing SGE Clusters on top of Hybrid Clouds using OpenNebula: http://www.opennebula.org/lib/exe/fetch.php?id=outreach&cache=cache&media=constantino_vazquez_-_opennebula_-_executing_sge_clusters_on_top_of_hybrid_clouds_using_opennebula.ppt Regards, Carlos Fernandez Sanchez Systems Manager CESGA Avda. de Vigo s/n. Campus Sur Tel.: (+34) 981569810, ext. 232 15705 - Santiago de Compostela SPAIN -------------------------------------------------- From: "Douglas Eadline" Sent: Tuesday, January 26, 2010 12:23 AM To: "Jonathan Aquilina" Cc: "Beowulf Mailing List" Subject: Re: [Beowulf] clustering using xen virtualized machines > > You may want to look at this: > > Building A Virtual Cluster with Xen > http://www.clustermonkey.net//content/view/139/33/ > > -- > Doug > > >> has anyone tried clustering using xen based vm's. what is everyones take >> on >> that? its something that popped into my head while in my lectures today. >> >> -- >> Jonathan Aquilina >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From bug at sas.upenn.edu Tue Jan 26 08:30:51 2010 From: bug at sas.upenn.edu (Gavin Burris) Date: Tue, 26 Jan 2010 11:30:51 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B5F18BB.8000406@sas.upenn.edu> Is it just me, or does HPC clustering and virtualization fall on opposite ends of the spectrum? With virtualization, you are pooling many virtual OS/server instances on high availablility hardware, sharing memory and cpu as demanded, oversubscribing. What would be idle time on one server, is utilized by another loaded server. With HPC clustering, you are running many physical OS/server instances that usually do not need to be highly available, but instead need to have direct access and total utilization of memory, cpu and storage. If queuing is done well, all servers are maxed out for performance under load. With xen/vmware/amazon clusters, it seems that you would be adding the complexity and cost of a virtualization infrastructure, with few of the benefits that virtualization is targeted to solve. Cheers. On 01/26/2010 10:24 AM, Hearns, John wrote: > for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? > > i honestly dont know much about xen as i just got my hands dirty with it. wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? > > > Jonathan, in a private reply I've already said that you should not be put off from having bright ideas! > > In no way wishing to rain on your parade - and indeed wishing you to experiment and keep asking questions, > which you are very welcome to do, this has been thought of. > > Cluster nodes are commonly run without and GUI - commandline only, as you say. > The debate comes around on this list every so often about running diskless! The answer is yes, you can run diskless compute > nodes, and I do. You boot them over the network, and have an NFS-root filesystem. > On many clusters the application software is NFS mounted also. > > Your point about a SAN is very relevant - I would say that direct, physical fibrechannel SAN connections in a cluster are > not common - simply due to the expense of installing the cards and a separate infrastructure. However, iSCSI is used and > Infiniband is common in clusters. > > > Apologies - I really don't want to come across as knowing better than you (which I don't). If we don't have people asking 2what if" and "hey - here's a good idea" then you won't make anything new. > > > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From Z.Wu at leeds.ac.uk Mon Jan 25 08:54:08 2010 From: Z.Wu at leeds.ac.uk (Zhili Wu) Date: Mon, 25 Jan 2010 16:54:08 +0000 Subject: [Beowulf] CFP -- (SMLA 2010) Scalable Machine Learning and Applications Workshop Message-ID: <15AB31C404448F4A918D077CA7E74C6E012B1989ADEE@HERMES7.ds.leeds.ac.uk> [Please accept our apologies if you receive multiple copies of this email] CALL FOR PAPERS The 2010 International Workshop on Scalable Machine Learning and Applications (SMLA-10) To be held in conjunction with CIT'10 (Supported by IEEE Computer Society), June 29 - July 1, 2010, Bradford, UK http://smlc09.leeds.ac.uk/smla/ http://www.scim.brad.ac.uk/~ylwu/CIT2010/ SCOPE: Machine learning and data mining have been playing an increasing role in many real scenarios, such as web mining, language processing, image search, financial engineering, etc. In these application domains, data are surpassing the scale of terabyte in an ever faster pace, but the techniques for processing and mining them often lag behind in far too many aspects. To deal with billions of web pages, images, transaction records and capacity-intensive audio and video data stream, machine learning and data mining techniques and their underlying computing infrastructure are facing great challenges. In this SMLA workshop we are willing to bring together researchers and practitioners for getting advancement in scalable machine learning and applications. On one hand we expect works on how to dramatically empower existing machine learning and data mining methods via grid/cloud or other novel computing models. On the other hand we value the effort of building or extending machine learning and data mining methods that are scalable to huge datasets. Papers can be related to any subset of the following topics, or any unconventional direction to scale up machine learning and data mining methods: -- Cloud Computing -- Large Scale Data Mining -- Fast Support Vector Machines -- Data Abstraction, Dimension Reduction -- User Personalization and Recommendation -- Natural Language Processing -- Ontology and Semantic Technologies -- Parallelization of Machine Learning Methods -- Fast Machine Learning Model Tuning and Selection -- Large Scale Webpage Topic, Genre, Sentiment Classification -- Financial Engineering STEERING COMMITTEE Chih-Jen Lin, Natinal Taiwan University, Taiwan Serge Sharoff, University of Leeds, UK Katja Markert, University of Leeds, UK Ivor Wai-Hung Tsang, Nanyang Technological University, Singapore PROGRAM CHAIRS Zhili Wu, University of Leeds, UK Xiaolong Jin, University of Bradford, UK PUBLICITY CHAIRS Evi Syukur, University of New South Wales, Australia Lei Liu, University of Bradford, UK PROGRAM COMMITTEE Please refer to http://smlc09.leeds.ac.uk/smla/committee.htm for a complete list of program committee PAPER SUBMISSION: Authors are invited to submit manuscripts reporting original unpublished research and recent developments in the topics related to the workshop. The length of the papers should not exceed 6 pages + 2 pages for overlength charges (IEEE Computer Society Proceedings Manuscripts style: two columns, single-spaced), including figures and references, using 10 fonts, and number each page. Papers should be submitted electronically in PDF format (or postscript) by sending it as an e-mail attachment to Zhili Wu (z.wu at leeds.ac.uk). All papers will be peer reviewed and the comments will be provided to the authors. The accepted papers will be published together with those of other CIT'10 workshops by the IEEE Computer Society Press. *********************************************************************** Distinguished selected papers, after further extensions, will be published in CIT 2010's special issues of the following prestigious SCI-indexed journals: -- The Journal of Supercomputing ?C Springer -- Journal of Computer and System Sciences ?C Elsevier -- Concurrency and Computation: Practice and Experience - John Wiley & Sons *********************************************************************** IMPORTANT DATES: Paper submission: February 15, 2010 Notification of Acceptance: April 01, 2010 Camera-ready due: April 18, 2010 Author registration: April 18, 2010 Conference: June 29 - July 1, 2010 *********************************************************************** From geoff at galitz.org Wed Jan 27 02:42:48 2010 From: geoff at galitz.org (Geoff Galitz) Date: Wed, 27 Jan 2010 11:42:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org>

<68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com><4B5F18BB.8000406@sas.upenn.edu><68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com><4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: I've had the good fortune to be in the HPC and also HA business for a few years (10 years for HPC but only about 4 for HA). Given the current approach for virtualization I don't see that Xen or other virtualization technologies are good for HPC environments if the performance is a paramount concern. Virtualization in an HPC/HA world is mostly beneficial for portability and fail-over. But the added layer for a hypervisor will be significant if your jobs run for an extended period of time. I've seen jobs that run for months... a 7% performance penalty (fairly typical in my experience) over the course of a month is significant. --------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/ From eagles051387 at gmail.com Wed Jan 27 04:08:25 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 13:08:25 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: gavin you mentioned costs, those are only incurred with xen if you need the extra features such as server migration and other features. also if you dont need those extra features couldnt you just live with the free version of xen. On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz wrote: > > > I've had the good fortune to be in the HPC and also HA business for a few > years (10 years for HPC but only about 4 for HA). Given the current > approach for virtualization I don't see that Xen or other virtualization > technologies are good for HPC environments if the performance is a > paramount > concern. > > Virtualization in an HPC/HA world is mostly beneficial for portability and > fail-over. But the added layer for a hypervisor will be significant if > your > jobs run for an extended period of time. I've seen jobs that run for > months... a 7% performance penalty (fairly typical in my experience) over > the course of a month is significant. > > > > --------------------------------- > Geoff Galitz > Blankenheim NRW, Germany > http://www.galitz.org/ > http://german-way.com/blog/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From bug at sas.upenn.edu Wed Jan 27 07:18:49 2010 From: bug at sas.upenn.edu (Gavin Burris) Date: Wed, 27 Jan 2010 10:18:49 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B605959.2000607@sas.upenn.edu> The cost for virtualization is in buying really big hardware, oodles of memory and many many cores, that are capable of running multiple VMs, and having that hardware configured for redundancy, high availability and failover. With an HPC cluster, you are typically buying hardware that is as stripped down and cheap as you can get it. You focus your HPC budget on the sweet-spot processor, the amount of memory, maybe GPUs, maybe interconnect, so you can deploy as many compute server nodes as you can afford. I don't buy the argument that the winning case is packaging up a VM with all your software. If you really are unable to build the required software stack for a given cluster and its OS, I think using something like xCAT to provision stateless compute servers per job is a better option than virtualization. And if you are packaging VMs to blast out to the cloud, I think you will be paying through the nose. This is not a viable option unless there is a major pricing shift. Cheers. On 01/27/2010 07:08 AM, Jonathan Aquilina wrote: > gavin you mentioned costs, those are only incurred with xen if you need > the extra features such as server migration and other features. also if > you dont need those extra features couldnt you just live with the free > version of xen. > > On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz > wrote: > > > > I've had the good fortune to be in the HPC and also HA business for > a few > years (10 years for HPC but only about 4 for HA). Given the current > approach for virtualization I don't see that Xen or other virtualization > technologies are good for HPC environments if the performance is a > paramount > concern. > > Virtualization in an HPC/HA world is mostly beneficial for > portability and > fail-over. But the added layer for a hypervisor will be significant > if your > jobs run for an extended period of time. I've seen jobs that run for > months... a 7% performance penalty (fairly typical in my > experience) over > the course of a month is significant. > > > > --------------------------------- > Geoff Galitz > Blankenheim NRW, Germany > http://www.galitz.org/ > http://german-way.com/blog/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- > Jonathan Aquilina > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Wed Jan 27 08:51:09 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 27 Jan 2010 08:51:09 -0800 Subject: [Beowulf] Re: Logging MCE information on next warm boot? Message-ID: Carsten Aulbert wrote: > > echo '<7>David Test' > /dev/kmsg > > That should produce output: > > Jan 27 08:32:24 10.10.12.43 [3098843.050122] David Test > > The 7 is the logging "severity" That was a good tip. Using the default dmesg setting and entering <0> -> <3> into the message string it logged across the network, <4> and up it did not. So netconsole seems to be working as expected. The mapping of numbers to types (error, emerg, etc.) seems not to be in the man files, or at least I have not found it there, but is in: /usr/include/sys/syslog.h and is #define LOG_EMERG 0 /* system is unusable */ #define LOG_ALERT 1 /* action must be taken immediately */ #define LOG_CRIT 2 /* critical conditions */ #define LOG_ERR 3 /* error conditions */ #define LOG_WARNING 4 /* warning conditions */ #define LOG_NOTICE 5 /* normal but significant condition */ #define LOG_INFO 6 /* informational */ #define LOG_DEBUG 7 /* debug-level messages */ thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From eagles051387 at gmail.com Wed Jan 27 09:07:48 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 18:07:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B605959.2000607@sas.upenn.edu> References: <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: thanks for all yoru responses. i admit i dont have the money at the moment or a job to get my hands dirty with hpc. im planning in the future to setup a rendering cluster. i appreciate all the feed back here. im just wondering now would for instance a head node be of any use running virtualized guest os's or does the head node need to not share the hardware with other os's On Wed, Jan 27, 2010 at 4:18 PM, Gavin Burris wrote: > The cost for virtualization is in buying really big hardware, oodles of > memory and many many cores, that are capable of running multiple VMs, > and having that hardware configured for redundancy, high availability > and failover. > > With an HPC cluster, you are typically buying hardware that is as > stripped down and cheap as you can get it. You focus your HPC budget on > the sweet-spot processor, the amount of memory, maybe GPUs, maybe > interconnect, so you can deploy as many compute server nodes as you can > afford. > > I don't buy the argument that the winning case is packaging up a VM with > all your software. If you really are unable to build the required > software stack for a given cluster and its OS, I think using something > like xCAT to provision stateless compute servers per job is a better > option than virtualization. > > And if you are packaging VMs to blast out to the cloud, I think you will > be paying through the nose. This is not a viable option unless there is > a major pricing shift. > > Cheers. > > > On 01/27/2010 07:08 AM, Jonathan Aquilina wrote: > > gavin you mentioned costs, those are only incurred with xen if you need > > the extra features such as server migration and other features. also if > > you dont need those extra features couldnt you just live with the free > > version of xen. > > > > On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz > > wrote: > > > > > > > > I've had the good fortune to be in the HPC and also HA business for > > a few > > years (10 years for HPC but only about 4 for HA). Given the current > > approach for virtualization I don't see that Xen or other > virtualization > > technologies are good for HPC environments if the performance is a > > paramount > > concern. > > > > Virtualization in an HPC/HA world is mostly beneficial for > > portability and > > fail-over. But the added layer for a hypervisor will be significant > > if your > > jobs run for an extended period of time. I've seen jobs that run for > > months... a 7% performance penalty (fairly typical in my > > experience) over > > the course of a month is significant. > > > > > > > > --------------------------------- > > Geoff Galitz > > Blankenheim NRW, Germany > > http://www.galitz.org/ > > http://german-way.com/blog/ > > > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > > > > -- > > Jonathan Aquilina > > > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlforrest at berkeley.edu Wed Jan 27 09:31:49 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed, 27 Jan 2010 09:31:49 -0800 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: <4B607885.1070709@berkeley.edu> At a recent Rocks clustering user's group meeting the recent addition of Rocks support of Xen-based virtual clusters came up. Some of the same questions recently raised on this list were discussed there. One justification for virtual clusters that I hadn't thought of was discussed. This only applies in places with large clusters run by a central computing group but used by various internal customers. Using virtual clusters makes it very easy to supply clusters to customers who need a cluster for a limited period of time. The amount of effort necessary to provision a new cluster is minimal. Nodes can easily and quickly be added, if necessary. This is as opposed to buying a new cluster for a research group, using it for a couple of months, and then turning it off. So, in this case, virtualized clusters have the advantage of being easier to manage. The performance overhead caused by the virtualization is a factor, but it's decreasing as time goes on due to better hardware support of virtualization and cleverer software. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From eagles051387 at gmail.com Wed Jan 27 10:30:14 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 19:30:14 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B607885.1070709@berkeley.edu> References: