From ebiederm at xmission.com Sat Jan 2 11:43:34 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 02 Jan 2010 11:43:34 -0800 Subject: [Beowulf] Performance tuning for Jumbo Frames In-Reply-To: <20091216080118.GB8679@bx9.net> (Greg Lindahl's message of "Wed\, 16 Dec 2009 00\:01\:18 -0800") References: <4B286517.10009@myri.com> <20091216080118.GB8679@bx9.net> Message-ID: Greg Lindahl writes: > On Tue, Dec 15, 2009 at 11:41:59PM -0500, Patrick Geoffray wrote: > >> So, instead of requiring ~4K per port minimum, you need about ~20K per >> port. Add to that up to 8 priorities with DCB and the buffering >> requirement are quickly getting out of hand. > > Don't worry, switch vendors will simply implement it all poorly, just > like InfiniBand. That's what always happens with overly-complicated > QOS schemes. The way I have heard per priority flow control expected to be used is one priority does pause frames ( for fcoe ) and the other priorities drop frames, as dropping frames when congested avoids head of line blocking, and thus gives better network performance overall. Of course in practice at 10G type speeds you want enough switch buffers so you don't normally drop packets, but that is a different story. Eric From sdi at cs.hku.hk Sun Jan 3 08:13:56 2010 From: sdi at cs.hku.hk (Di sheng) Date: Mon, 4 Jan 2010 00:13:56 +0800 Subject: [Beowulf] [hpc-announce] CCGrid 2010: CALL FOR POSTERS Message-ID: <47e012df1001030813u1331402eg7233364171ef724a@mail.gmail.com> --------------------------------------------------------------------------- We apologize if you receive multiple copies of this CFP --------------------------------------------------------------------------- *********************************************************************** * CCGrid 2010: CALL FOR POSTERS * Poster submission deadline extended to 18 January 2010 * * The 10th IEEE/ACM International Symposium on * Cluster, Cloud and Grid Computing (CCGrid 2010) * May 17-20, 2010, Melbourne, Victoria, Australia * URL: http://www.manjrasoft.com/ccgrid2010/ ********************************************************************** We invite participants to submit a poster to the CCGrid 2010 conference. CCGrid aims at presenting the latest breakthroughs in Cluster, Grid and Cloud technologies for both academic and industry professionals. The submission areas of interest match that of the conference, and include the following areas: * Cluster technologies * Grid Architectures and Systems * Utility Computing Models for Clusters and Grids * Grid Economies and Service Architectures * Service Composition and Orchestration * Middleware for Clusters and Grids * Parallel and Wide-Area File Systems * Peer-to-Peer Systems * Cloud Computing * Community and collaborative computing networks * Grid Trust and Security * Support for Autonomic Grid Infrastructure * Resource Management * Scheduling and Load Balancing * Programming Models, Tools, and Environments * Performance Evaluation and Modeling * Grid-based Problem Solving Environments * Scientific, Engineering, and Commercial Applications Posters can be submitted in one of two ways: 1. Proceedings published posters: Participants submitting proceedings published posters are required to submit a 2-page short paper describing the poster content, research, relevance and importance to the cluster, grid and cloud computing community. If accepted, these 2-page short paper will be published in the proceedings of the conference. 2. Web published posters: These posters require a short 1-page abstract to be submitted. These abstracts will not be included in the conference proceedings, but will be published on the conference website. For both forms of posters, participants will be able to display the poster during the conference and give a short presentation about their poster. Important dates and guidelines for poster submission can be found below: Deadline for proceeding published posters : Jan 18th Notification of Acceptance : Feb 5th Final Version Due : Feb 12th Deadline for web published posters : Apr 1st Two awards, (a) best poster award and (b) best poster presentation award, will be presented to two posters, as sponsored by ManjraSoft. Posters Committee: ================= Ahmad Afsahi, Queen's University David Bernholdt, Oak Ridge National Laboratory Darius Buntinas, Argonne National Laboratory Yong Chen, Illinois Institute of Technology Mark Gardner, Virginia Tech Torsten Hoefler, Indiana University Hyun-wook Jin, Konkuk University Nicholas Karonis, Northern Illinois University Zhiling Lan, Illinois Institute of Technology Jiuxing Liu, IBM Research Scott Pakin, Los Alamos National Laboratory Sayantan Sur, IBM Research Abhinav Vishnu, Pacific Northwest National Laboratory Venkatram Vishwanath, Argonne National Laboratory From mdidomenico4 at gmail.com Mon Jan 4 17:54:48 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 4 Jan 2010 20:54:48 -0500 Subject: [Beowulf] cisco networking Message-ID: its my understanding that you can only have one lacp/etherchannel link between two cisco switches and that link can only comprise up to eight actual links which using 10Gbps links would yield around ~8GB/sec Does anyone know if it's possible to get more? Between two Cisco 6500's? Im pretty sure the answer is no, but i figured i'd check... From doseyg at r-networks.net Sat Jan 9 22:29:12 2010 From: doseyg at r-networks.net (Glen Dosey) Date: Sun, 10 Jan 2010 01:29:12 -0500 Subject: [Beowulf] cisco networking In-Reply-To: References: Message-ID: <1263104952.5838.37.camel@eclipse.office.r-networks.net> You can have multiple ether-channel links between 2 switches. The limitation is that if they are in the same layer 2 broadcast domain (VLAN) only 1 will be active and STP will block the other. If you place each switch in a separate network you could use equal cost load balancing across multiple point to point SVI's on ether-channel links. You'll increase bandwidth at the cost of a little additional latency. Of course at the bandwidths you are talking about (160 Gbit/s or more ) I have no idea how it would really work. Backplane bandwidth, hashing algorithms and all sorts of other factor come into play and could cause it to fail completely. I'd love to try it out but I can't help but think there is probably a better architectural solution. What are you trying to do ? On Mon, 2010-01-04 at 20:54 -0500, Michael Di Domenico wrote: > its my understanding that you can only have one lacp/etherchannel link > between two cisco switches and that link can only comprise up to eight > actual links > > which using 10Gbps links would yield around ~8GB/sec > > Does anyone know if it's possible to get more? Between two Cisco 6500's? > > Im pretty sure the answer is no, but i figured i'd check... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mdidomenico4 at gmail.com Mon Jan 11 06:58:24 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 11 Jan 2010 09:58:24 -0500 Subject: [Beowulf] cisco networking In-Reply-To: <1263104952.5838.37.camel@eclipse.office.r-networks.net> References: <1263104952.5838.37.camel@eclipse.office.r-networks.net> Message-ID: On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey wrote: > You can have multiple ether-channel links between 2 switches. The > limitation is that if they are in the same layer 2 broadcast domain > (VLAN) only 1 will be active and STP will block the other. > > If you place each switch in a separate network you could use equal cost > load balancing across multiple point to point SVI's on ether-channel > links. You'll increase bandwidth at the cost of a little additional > latency. Of course at the bandwidths you are talking about (160 Gbit/s > or more ) I have no idea how it would really work. Backplane bandwidth, > hashing algorithms and all sorts of other factor come into play and > could cause it to fail completely. I'd love to try it out but I can't > help but think there is probably a better architectural solution. > > What are you trying to do ? The attempt is really to get a bunch of devices which only have ethernet 1Gbps connectivity, to talk to a bunch of other machines with Infiniband Connectivity. (Ie, compute nodes to storage). But we want the aggregate bandwidth to be over 5GBps. I was trying to see if there was a simple solution using some of the existing stuff we have, without having to get overly creative with the network or purchase 10G Ethernet to IB routers... Looks like IB routers is the only solution that really works for me... From Greg at keller.net Mon Jan 11 12:53:47 2010 From: Greg at keller.net (Greg Keller) Date: Mon, 11 Jan 2010 14:53:47 -0600 Subject: [Beowulf] cisco networking In-Reply-To: <201001112000.o0BK07NA020635@bluewest.scyld.com> References: <201001112000.o0BK07NA020635@bluewest.scyld.com> Message-ID: <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> > > Date: Mon, 11 Jan 2010 09:58:24 -0500 > From: Michael Di Domenico > Subject: Re: [Beowulf] cisco networking > To: Glen Dosey > Cc: Beowulf at beowulf.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey > wrote: >> You can have multiple ether-channel links between 2 switches. The >> limitation is that if they are in the same layer 2 broadcast domain >> (VLAN) only 1 will be active and STP will block the other. >> >> If you place each switch in a separate network you could use equal >> cost >> load balancing across multiple point to point SVI's on ether-channel >> links. You'll increase bandwidth at the cost of a little additional >> latency. Of course at the bandwidths you are talking about (160 >> Gbit/s >> or more ) I have no idea how it would really work. Backplane >> bandwidth, >> hashing algorithms and all sorts of other factor come into play and >> could cause it to fail completely. I'd love to try it out but I can't >> help but think there is probably a better architectural solution. >> >> What are you trying to do ? > > The attempt is really to get a bunch of devices which only have > ethernet 1Gbps connectivity, to talk to a bunch of other machines with > Infiniband Connectivity. (Ie, compute nodes to storage). But we want > the aggregate bandwidth to be over 5GBps. > > I was trying to see if there was a simple solution using some of the > existing stuff we have, without having to get overly creative with the > network or purchase 10G Ethernet to IB routers... > > Looks like IB routers is the only solution that really works for me... > > > ------------------------------ > Remember that you can actually use nodes to route IP between GbE and IB, just don't expect them to run at IB wirespeed. This can at least save you from paying a premium for proprietary hardware in order to test functionality. Last I tried 3-4 Gbit was about as fast as IP over IB would get through the Interface presumably due to IP overhead... but that was long ago so if you try it I'd love to hear your results. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmdavis1 at vcu.edu Mon Jan 11 14:19:56 2010 From: jmdavis1 at vcu.edu (Mike Davis) Date: Mon, 11 Jan 2010 17:19:56 -0500 Subject: [Beowulf] cisco networking In-Reply-To: <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> References: <201001112000.o0BK07NA020635@bluewest.scyld.com> <8692F655-98DA-4E6D-849D-6FA7D5AC822D@Keller.net> Message-ID: <4B4BA40C.9090607@vcu.edu> Greg Keller wrote: > >> >> Date: Mon, 11 Jan 2010 09:58:24 -0500 >> From: Michael Di Domenico > > >> Subject: Re: [Beowulf] cisco networking >> To: Glen Dosey > >> Cc: Beowulf at beowulf.org >> Message-ID: >> > > >> Content-Type: text/plain; charset=ISO-8859-1 >> >> On Sun, Jan 10, 2010 at 1:29 AM, Glen Dosey > > wrote: >>> You can have multiple ether-channel links between 2 switches. The >>> limitation is that if they are in the same layer 2 broadcast domain >>> (VLAN) only 1 will be active and STP will block the other. >>> >>> If you place each switch in a separate network you could use equal cost >>> load balancing across multiple point to point SVI's on ether-channel >>> links. You'll increase bandwidth at the cost of a little additional >>> latency. Of course at the bandwidths you are talking about (160 Gbit/s >>> or more ) I have no idea how it would really work. Backplane bandwidth, >>> hashing algorithms and all sorts of other factor come into play and >>> could cause it to fail completely. I'd love to try it out but I can't >>> help but think there is probably a better architectural solution. >>> >>> What are you trying to do ? >> >> The attempt is really to get a bunch of devices which only have >> ethernet 1Gbps connectivity, to talk to a bunch of other machines with >> Infiniband Connectivity. (Ie, compute nodes to storage). But we want >> the aggregate bandwidth to be over 5GBps. >> >> I was trying to see if there was a simple solution using some of the >> existing stuff we have, without having to get overly creative with the >> network or purchase 10G Ethernet to IB routers... >> >> Looks like IB routers is the only solution that really works for me... >> >> >> - Cisco switched can be connected via trunked links. This allows you to use multiple links to increase the bandwidth. The limit used to be 3 but that may have changed. You can check the switche's online documentation for more information. From forum.san at gmail.com Mon Jan 11 23:19:47 2010 From: forum.san at gmail.com (Sangamesh B) Date: Tue, 12 Jan 2010 12:49:47 +0530 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly Message-ID: Hi HPC experts, I seek your advise/suggestion to resolve a storage(NAS) server' repeated hanging problem. We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of capacity 12 TB is connected to a management server Sun Fire X4150 installed with RHEL 5.3 and this server is connected to a Gigabit switch which provides cluster private network. The home directories on the cluster are NFS mounted from storage partitions across all nodes including the master. This server gets hanged repeatedly. As an initial troubleshooting we installed Ganglia, to check network utilization. But its normal. We're not getting how to troubleshoot it and resolve the problem. Can anybode help us resolve this issue? Thanks, Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Tue Jan 12 01:47:18 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 12 Jan 2010 09:47:18 -0000 Subject: [Beowulf] Need some advise: Sun storage' management server hangsrepeatedly In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0ED811A1@milexchmb1.mil.tagmclarengroup.com> ?? This server gets hanged repeatedly. As an initial troubleshooting we installed Ganglia, to check network utilization. But its normal. We're not getting how to troubleshoot it and resolve the problem. Can anybode help us resolve this issue? Some tools you need: ping -f on the clients (ie flood pings) tcpdump on both clients and server nfsstat on both client and server iostat on the server Can you be a little bit clearer on what you mean by a "hang" Do you see messages about NFS server in the /var/log/messages on the client? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From rpnabar at gmail.com Tue Jan 12 23:06:25 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 01:06:25 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? Message-ID: I always took it as natural to keep all compute nodes on a private switch and assigned them local I/P addresses. This was almost axiomatic for an HPC application in my mind. This way I can channel all traffic to the world and logins while a select login-node. Then firewall the login nodes carefully. Just today, though, on a new project the admin said he always keeps his compute nodes with public I/Ps and runs individual firewalls on them. This seemed just so wrong to me in so many ways but i was curious if there are legitimate reasons why people might do this? Just curious. -- Rahul From beat at 0x1b.ch Tue Jan 12 23:37:50 2010 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 13 Jan 2010 08:37:50 +0100 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: Message-ID: Hello! Quoting (13.01.10 08:06): > This seemed just so wrong to me in so many ways but i was curious if > there are legitimate reasons why people might do this? Just curious. I see both approaches. Even the private LAN is the more common solution. There are applications which needs interaction with some graphical frontend on the workstation of the user. Other reasons are braindead license servers which are not NATable. Like the ones used by Catia or LS-DYNA. Management could be much easier when the administrator is able to contact every device directly from his workstation. Of course all of those examples won't need public IPs. A range of campus or company wide routed private IPs is good enough. Remeber 2010 is the last year where IANA is able to provide IP space :-) The private LAN has the big advantage of beeing a "protected zone". Usually located in a locked datacenter. Exporting NFS or any kind od cluster filesystem to the whole subnet is much, much easier then using dedicated exports or netgroups for each node. Several cluster related tools are not filtering requests and are vulnerable by spoofing attacks. I mainly think of Ganglia or syslogd which accepts any UDP package sent to them. Opening the cluster LAN means always an additional effort to keep the system secure. So both approaches makes sense. It depends on your needs and your existing environment. And also on your experience in system and network security. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From eugen at leitl.org Wed Jan 13 01:37:41 2010 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 13 Jan 2010 10:37:41 +0100 Subject: [Beowulf] Sun/AMD HPC for Dummies ebook now as PDF Message-ID: <20100113093741.GL17686@leitl.org> http://www.sun.com/x64/ebooks/hpc.jsp HPC for Dummies HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From tegner at renget.se Wed Jan 13 03:40:41 2010 From: tegner at renget.se (tegner at renget.se) Date: Wed, 13 Jan 2010 12:40:41 +0100 Subject: [Beowulf] Parallel file systems Message-ID: While starting to investigating different storage solutions I came across gluster (www.gluster.com). I did a search on beowulf.org and came up with nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. Anyone with experience of gluster in HPC? Regards, /jon From bob at drzyzgula.org Wed Jan 13 05:34:30 2010 From: bob at drzyzgula.org (Bob Drzyzgula) Date: Wed, 13 Jan 2010 08:34:30 -0500 Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF In-Reply-To: <20100113093741.GL17686@leitl.org> References: <20100113093741.GL17686@leitl.org> Message-ID: <20100113133430.GA27540@mx1.drzyzgula.org> FYI, I downloaded this and it is 46 pages of cursory, high-level overview of the concept of HPC. It doesn't even have any Rich Tennant cartoons. Just about anyone would get more out of spending an hour following threads out of the Wikipedia High-Performance Computing page. The execption might be a manager who doesn't know what the term "high-performance computing" means but at the same time somehow has a budget to go out and buy and staff a cluster -- that person might learn a little about where to start. On the plus side, the registration form seems to accept just about any random crap, and didn't even make me check the mailinator email address I used to get the download link. --Bob On 13/01/10 10:37 +0100, Eugen Leitl wrote: > > http://www.sun.com/x64/ebooks/hpc.jsp > > HPC for Dummies > > HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. > > This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. > > https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp > > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Wed Jan 13 10:05:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 12:05:27 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: On Wed, Jan 13, 2010 at 1:37 AM, Beat Rubischon wrote: > on the workstation of the user. Other reasons are braindead license servers > which are not NATable. Like the ones used by Catia or LS-DYNA. Management > could be much easier when the administrator is able to contact every device > directly from his workstation. Thanks! Oh! I thought NAT worked transparently and the application didnt even realize it was NAT-ed. I didn't know some servers could have a problem with this. -- Rahul From hahn at mcmaster.ca Wed Jan 13 10:37:39 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 13 Jan 2010 13:37:39 -0500 (EST) Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: >> on the workstation of the user. Other reasons are braindead license servers >> which are not NATable. Like the ones used by Catia or LS-DYNA. Management >> could be much easier when the administrator is able to contact every device >> directly from his workstation. I don't agree with the latter, at all. the marginal effort of admining through another box is trivial. > Oh! I thought NAT worked transparently and the application didnt even > realize it was NAT-ed. I didn't know some servers could have a problem > with this. a client using NAT will not know any different, but the _talked_to_ service might, since it'll see multiple connections from the same NAT server address(es), and won't be able to originate a socket to the client. (unless the NATer has some protocol-specific awareness like NATed FTP.) we have all our compute nodes on private addresses and also disable NAT. this does make it somewhat trickier to get external-hosted evil-type licenses to work (flexlm with vendor daemons). but I'd say this is a fairly useful dividing issue: clusters that are all-public tend to be personal-ish and small. clusters that are larger and support very wide groups tend to be more tightly controlled. I think the way to think of it is that if you have a personal or limited-purpose cluster, you _do_ in fact want it to depend on (and wait on) external resources (licenses, fileservers, GUI apps). for a large, broad-purpose cluster with lots of disparate users, it's very important to minimize those sources of complexity and inefficiency. From joshua_mora at usa.net Wed Jan 13 09:08:31 2010 From: joshua_mora at usa.net (Joshua mora acosta) Date: Wed, 13 Jan 2010 11:08:31 -0600 Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF Message-ID: <788oamRHf7008S29.1263402511@cmsweb29.cms.usa.net> Hi, I think you misinterpreted the tittle. It is what it is, "HPC for dummies". Enough to expose in a plain way to anyone what HPC is, which may not be that easy to make a good summary of such a broad topic in 46 pages. It would be great to see though a tittle like "HPC for the next decade" or "beyond HPC" that summarizes the ongoing investigations/challenges we are going to be facing. That summary of _balanced (politically correct)/broad(physics+hw+sw+business)_ ideas though may need similar level of effort as the effort done on "HPC for dummies". Regards, Joshua ------ Original Message ------ Received: 07:45 AM CST, 01/13/2010 From: Bob Drzyzgula To: Eugen Leitl Cc: Beowulf at beowulf.org Subject: [Beowulf] Re: Sun/AMD HPC for Dummies ebook now as PDF > FYI, I downloaded this and it is 46 pages of cursory, > high-level overview of the concept of HPC. It doesn't > even have any Rich Tennant cartoons. Just about anyone > would get more out of spending an hour following threads > out of the Wikipedia High-Performance Computing page. The > execption might be a manager who doesn't know what the > term "high-performance computing" means but at the same > time somehow has a budget to go out and buy and staff > a cluster -- that person might learn a little about > where to start. > > On the plus side, the registration form seems to accept > just about any random crap, and didn't even make me check > the mailinator email address I used to get the download > link. > > --Bob > > On 13/01/10 10:37 +0100, Eugen Leitl wrote: > > > > http://www.sun.com/x64/ebooks/hpc.jsp > > > > HPC for Dummies > > > > HPC enables us to first model then manipulate products, services, and techniques. These days, HPC has moved from a selective and expensive endeavor to a cost-effective enabling technology within reach of virtually every budget. This book will help you to get a handle on exactly what HPC does and can be. > > > > This special edition eBook from Sun and AMD shares details on real-world uses of HPC, explains the different types of HPC, guides you on how to choose between different suppliers, and provides benchmarks and guidelines you can use to get your system up and running. > > > > https://dct.sun.com/dct/forms/reg_us_1808_222_0.jsp > > > > -- > > Eugen* Leitl leitl http://leitl.org > > ______________________________________________________________ > > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From rpnabar at gmail.com Wed Jan 13 16:35:46 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 13 Jan 2010 18:35:46 -0600 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: On Wed, Jan 13, 2010 at 12:37 PM, Mark Hahn wrote: > we have all our compute nodes on private addresses and also disable NAT. > this does make it somewhat trickier to get external-hosted evil-type > licenses to work (flexlm with vendor daemons). ?but I'd say this is a fairly > useful > dividing issue: clusters that are all-public tend to be personal-ish and > small. ?clusters that are larger and support very wide groups tend to be > more tightly controlled. ?I think the way to think of it is that if you > have a personal or limited-purpose cluster, you _do_ in fact want it to > depend on (and wait on) external resources (licenses, fileservers, GUI > apps). > for a large, broad-purpose cluster with lots of disparate users, it's very > important to minimize those sources of complexity and inefficiency. Thanks Mark! Makes a lot of sense. Thanks for sharing your experiences! -- Rahul From skylar at cs.earlham.edu Wed Jan 13 20:58:34 2010 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 13 Jan 2010 20:58:34 -0800 Subject: [Beowulf] are compute nodes always kept in a private I/P and switch space? In-Reply-To: References: Message-ID: <4B4EA47A.2070500@cs.earlham.edu> Rahul Nabar wrote: > I always took it as natural to keep all compute nodes on a private > switch and assigned them local I/P addresses. This was almost > axiomatic for an HPC application in my mind. This way I can channel > all traffic to the world and logins while a select login-node. Then > firewall the login nodes carefully. > > Just today, though, on a new project the admin said he always keeps > his compute nodes with public I/Ps and runs individual firewalls on > them. > > This seemed just so wrong to me in so many ways but i was curious if > there are legitimate reasons why people might do this? Just curious. > > I do everything I can to keep cluster nodes on a private network, with only the head node visible on the public network. One exception I've had to make is when storage is on a separate network. NAT doesn't do well with CIFS/NFS so it's just easier giving the nodes fully-routeable IP addresses. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From skylar at cs.earlham.edu Wed Jan 13 21:08:43 2010 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 13 Jan 2010 21:08:43 -0800 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: References: Message-ID: <4B4EA6DB.8070207@cs.earlham.edu> Sangamesh B wrote: > Hi HPC experts, > > I seek your advise/suggestion to resolve a storage(NAS) server' > repeated hanging problem. > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > capacity 12 TB is connected to a management server Sun Fire X4150 > installed with RHEL 5.3 and this server is connected to a Gigabit > switch which provides cluster private network. The home directories on > the cluster are NFS mounted from storage partitions across all nodes > including the master. > > This server gets hanged repeatedly. As an initial troubleshooting > we installed Ganglia, to check network utilization. But its normal. > We're not getting how to troubleshoot it and resolve the problem. Can > anybode help us resolve this issue? Is there anything amiss according to the service processor? -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From rpnabar at gmail.com Thu Jan 14 15:40:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 14 Jan 2010 17:40:27 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. Message-ID: I've never had a cluster large enough to matter but I was thinking of running the Linpak-HPL benchmark (from the top500 site) just out of curiosity of knowing my actual teraflops. For one, it would tell me how much my Rmax/Rpeak so that I know how non-optimal my network and other infrastructure was. Question: How difficult it is getting that benchmark to run? Was eager to know of opinions from sys-admins who've "been there, done that". If it was a horrendously difficult process I might just skip it. Some their tuning sections were scary. Are there any good pointers on tuning parameter selection or ready made makefiles? I had a Intel Nehalem processor and a regular Gigabit network. -- Rahul From gus at ldeo.columbia.edu Thu Jan 14 17:25:17 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 14 Jan 2010 20:25:17 -0500 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: Message-ID: <4B4FC3FD.6040302@ldeo.columbia.edu> Hi Rahul It is a bit involved, but not very difficult to setup HPL. First get the Goto BLAS/LAPACK from TACC: http://www.tacc.utexas.edu/tacc-projects/ Install it using the Gnu compilers. Then get HPL from Netlib: www.netlib.org/benchmark/hpl/ Tweak with the Makefile to point to your mpi wrappers, and to the Goto Library. Build HPL. Read the TUNING file that comes with HPL. It has important information about the input parameters. The main ones are N, and P,Q. www.netlib.org/benchmark/hpl/tuning.html First, to test, run HPL in a single node or a few nodes, using small values of N, say 1000 to 20000. The maximum value of N can be approximated by Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). This uses all the RAM, but doesn't get into memory paging. Then run HPL on the whole cluster with the Nmax above. Nmax pushes the envelope, and is where your best performance (Rmax/Rpeak) is likely to be reached. Try several P/Q combinations for Nmax (see the TUNING file). I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Rahul Nabar wrote: > I've never had a cluster large enough to matter but I was thinking of > running the Linpak-HPL benchmark (from the top500 site) just out of > curiosity of knowing my actual teraflops. For one, it would tell me > how much my Rmax/Rpeak so that I know how non-optimal my network and > other infrastructure was. > > Question: How difficult it is getting that benchmark to run? Was eager > to know of opinions from sys-admins who've "been there, done that". > If it was a horrendously difficult process I might just skip it. Some > their tuning sections were scary. > > Are there any good pointers on tuning parameter selection or ready > made makefiles? I had a Intel Nehalem processor and a regular Gigabit > network. > > From walid.shaari at gmail.com Sat Jan 16 00:50:48 2010 From: walid.shaari at gmail.com (Walid) Date: Sat, 16 Jan 2010 11:50:48 +0300 Subject: [Beowulf] HPC/mpi courses Message-ID: Dear All, do you know of any official courses run in Europe, or Asia covering HPC system, or development. mpi or new distributed memory paradigms are welcome. kind regards Walid From madskaddie at gmail.com Sat Jan 16 02:44:12 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Sat, 16 Jan 2010 10:44:12 +0000 Subject: [Beowulf] Gridengine and bash + Modules Message-ID: Greetings, I'm using gridengine (6.2u4, open source ver.) and I would like to use the Modules software. Modules uses a shell function that must be exported (bash: "export -f func_name" in order to set environment variables), but gridengine has a bug related with bash exported functions[1]. Is anybody using gridengine, bash and modules? How to solve this? Changing shell is not an option ;) This issue is also being discussed here[2]. Thanks, Gil [1] - http://gridengine.sunsource.net/issues/show_bug.cgi?id=2173 [2] - http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&viewType=browseAll&dsMessageId=238562#messagefocus -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From richard.walsh at comcast.net Sat Jan 16 05:06:55 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 16 Jan 2010 13:06:55 +0000 (UTC) Subject: [Beowulf] HPC/mpi courses In-Reply-To: Message-ID: <968564064.9709281263647215844.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> On Saturday, January 16, 2010 Khalid M. Issa wrote: >Do you know of any official courses run in Europe, or Asia covering >HPC system, or development. mpi or new distributed memory paradigms >are welcome. No, but I could send you a PDF of about 100 slides I put together for a side-by-side course I teach on CAF and UPC (for your use only). I also have am am MPI intro course, but cannot send that (copyrighted). Finally, I recommend looking at the US National Lab sites (LLNL in particular) which have excellent OpenMP and MPI tutorials. Regards, Richard Walsh Principal, Thrashing River Computing, and Parallel Applications and Systems Manager, CUNY HPC Center _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcownie at cantab.net Sat Jan 16 06:14:24 2010 From: jcownie at cantab.net (James Cownie) Date: Sat, 16 Jan 2010 14:14:24 +0000 Subject: [Beowulf] HPC/mpi courses In-Reply-To: References: Message-ID: <70EC0BEE-F6FC-422D-BE96-39A9FE74E9C1@cantab.net> On 16 Jan 2010, at 08:50, Walid wrote: > Dear All, > > do you know of any official courses run in Europe, or Asia covering > HPC system, or development. mpi or new distributed memory paradigms > are welcome. > https://fs.hlrs.de/projects/par/events/2010/parallel_prog_2010/ there are no doubt many others. -- -- Jim -- James Cownie -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Sat Jan 16 19:02:35 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 16 Jan 2010 21:02:35 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: <4B4FC3FD.6040302@ldeo.columbia.edu> References: <4B4FC3FD.6040302@ldeo.columbia.edu> Message-ID: On Thu, Jan 14, 2010 at 7:25 PM, Gus Correa wrote: > > First, to test, run HPL in a single node or a few nodes, > using small values of N, say 1000 to 20000. > > The maximum value of N can be approximated by > Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). > This uses all the RAM, but doesn't get into memory paging. > > Then run HPL on the whole cluster with the Nmax above. > Nmax pushes the envelope, and is where your > best performance (Rmax/Rpeak) is likely to be reached. > Try several P/Q combinations for Nmax (see the TUNING file). > Thanks Gus! That helps a lot. I have Linpak running now on just a single server and am trying to tune and hit the Rpeak. I'm getting 62 Gflops but I think my peak has to be around 72 (2.26 GHz 8 cores Nehalem). On a single server test do you manage to hit the theoretical peak?What's a good Rmax / Rpeak to shoot for while tuning? Once I am confident I'm well tuned on one server I'll try and extend it to the whole cluster. -- Rahul From hearnsj at googlemail.com Sun Jan 17 01:01:20 2010 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 17 Jan 2010 09:01:20 +0000 Subject: [Beowulf] Jobs in the UK Message-ID: <9f8092cc1001170101m40e9f0ebm4b51d8c42927b743@mail.gmail.com> If anyone is looking for a job in the UK, there are a few on offer: http://www.jobserve.com/Systems-Specialist-Abingdon-Abingdon-Oxfordshire-Permanent-W6BA2DC924C566BD1.jsjob (I think you can work out who this is.... if not I can tell you off list!) http://www.jobserve.com/High-Performance-Computing-Grid-Services-Systems-Manager-Administrator-Kingston-upon-Thames-Surrey-Permanent-WE3A3B085349F371A.jsjob (I have installed clusters in both places.) There was an HPC job with a company in Oxfordshire, but no longer found on jobserve. From robh at dongle.org.uk Sun Jan 17 03:24:48 2010 From: robh at dongle.org.uk (Rob Horton) Date: Sun, 17 Jan 2010 11:24:48 +0000 Subject: [Beowulf] HPC/mpi courses In-Reply-To: References: Message-ID: <20100117112448.GA1181@wyddfa.dongle.org.uk> On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: > Dear All, > > do you know of any official courses run in Europe, or Asia covering > HPC system, or development. mpi or new distributed memory paradigms > are welcome. NAG run various courses on behalf of HECToR in the UK: http://www.hector.ac.uk/cse/training/ I'm not sure what the access arrangements are if your work isn't covered by one of the UK research councils. Rob From rpnabar at gmail.com Sun Jan 17 13:07:10 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 15:07:10 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping Message-ID: If I have a option between doing Hardware RAID versus having software raid via mdadm is there a clear winner in terms of performance? Or is the answer only resolvable by actual testing? I have a fairly fast machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Should I be using the vendor's hardware RAID or mdadm? In case a generic answer is not possible, what might be a good way to test the two options? Any other implications that I should be thinking about? Finally, there;s always hybrid approaches. I could have several small RAID5's at the hardware level (RIAD5 seems ok since I have smaller disks ~300 GB so not really in the domain where the RAID6 arguments kick in, I think) Then using LVM I can integrate storage while asking LVM to stripe across these RAID5's. Thus I'd get striping at two levels: LVM (software) and RAID5 (hardware). -- Rahul From bill at cse.ucdavis.edu Sun Jan 17 14:51:27 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sun, 17 Jan 2010 14:51:27 -0800 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B53946F.4080008@cse.ucdavis.edu> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? No. > Or is > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. > > Should I be using the vendor's hardware RAID or mdadm? Depends. Are you performance limited? If not I'd go with the one that makes the admin happier. I prefer software raid because I'm familiar with the interface, monitoring, and recovery. Not to mention the flexibility of setting RAID per partition and the ability to easily move RAIDs between machines is valuable to me. With hardware RAID I'd want a spare controller around in case of failure. Be warned that various hardware RAID companies seem to be making it harder to do software RAID. I've seen controllers do nasty things like not boot with 16 JBOD devices. > In case a > generic answer is not possible, what might be a good way to test the > two options? Ideally you would use your performance limited application as a benchmark. Sure there are plenty of micro benchmarks to help. Things like bonnie++, iozone, or postmark for a variety of random and sequential workloads. Postmark in particular can simulate a wide range of workloads. The ultimate benchmark is your particular usage. > Any other implications that I should be thinking about? Learning GUI, bugs, interface, of hardware requires time and energy. I'd look at the failure modes you are trying protect against. As far as reliability go I like for smartctl, lmsensors, RAID scrubbing and the like to work to I can keep an eye on the disks health. > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). For a given level of reliability size does effect the maximum number of disks in a RAID. Double disk failures do happen. I'd hesitate to spread a file system across multiple RAIDs just because of recovery performance issues. After all a file system spread across multiple RAIDs effectively reduces the entire group of disks to a single head. For 16 disks I often use three 5 disk RAID5s and one global spare. If I run 3 workloads, one each per 5 disks I get dramatically better performance than 3 workloads running on 15 disks. So most issues (slow performance during backups, failures, disk full, migration, etc) only effects 5 disks at a time and could even be migrated to a different machines. From ljdursi at scinet.utoronto.ca Sun Jan 17 15:07:45 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Sun, 17 Jan 2010 18:07:45 -0500 Subject: [Beowulf] HPC/mpi courses In-Reply-To: <20100117112448.GA1181@wyddfa.dongle.org.uk> References: <20100117112448.GA1181@wyddfa.dongle.org.uk> Message-ID: <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> On 2010-01-17, at 6:24AM, Rob Horton wrote: > On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: >> >> do you know of any official courses run in Europe, or Asia covering >> HPC system, or development. mpi or new distributed memory paradigms >> are welcome. > > NAG run various courses on behalf of HECToR in the UK: > http://www.hector.ac.uk/cse/training/ We have videos and slides up of a week-long MPI/OpenMP course we teach at SciNet at the University of Toronto: http://www.cita.utoronto.ca/~ljdursi/PSP/ Videos online are no substitute for being in the classroom yourself, of course, but it's better than nothing. Along those lines, does anyone have a good HPC / parallel computing textbook to get users started? There are (say) passable MPI books, or OpenMP, or even on the Intel thread building block stuff, but very little that integrates everything that I can find. Similarly with performance issues; O'Reilly used to have a pretty solid little book on HPC book which was very nice for teaching people to think about serial optimization, but the last edition was 1998 and I can't find anything comparable. - Jonathan -- Jonathan Dursi From a.travis at abdn.ac.uk Sun Jan 17 15:08:51 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Sun, 17 Jan 2010 17:08:51 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B539883.8040003@abdn.ac.uk> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? Or is > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Hello, Rahul. It depends which level of RAID you want to use, and if you want hot-swap capability. I use inexpensive 3ware 8006-2 RAID1 controllers and stripe them using "md" software RAID0 to make RAID10 arrays. This gives me good performance and hot-swap capability (the production md RAID driver does not support hot-swap). However, where "md" really scores is portability. My RAID's can only be read by 3ware controllers - I made a considered descision about this: The 3ware controllers are well-supported by Linux kernels, but it makes me uneasy using a proprietary RAID format. I do also use "md" RAID5 which is more space efficient, but read this: http://www.baarf.com/ > Should I be using the vendor's hardware RAID or mdadm? In case a > generic answer is not possible, what might be a good way to test the > two options? Any other implications that I should be thinking about? In fact, "mdadm" is just the user-space command for controlling the "md" driver. The problem with using an on-board RAID controller is that many of these are 'host' RAID (i.e. need a Windows driver to do the RAID) in which case you are using the CPU anyway, and they also use proprietary formats. Generally, I just use SATA mode on the on-board RAID controller and create an "md" RAID. This means that I can replace a motherboard withour worrying if it has the same type of RAID controller on-board. > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). Yes, I think a hybrid approach is good because that's what I use ;-) However, I would avoid relying on LVM mirroring for data protection. It is much safer to stripe a set of RAID1's using LVM. I don't think LVM is useful unless you are managing a disk farm. The commonest issue in disk perfomance is decoupling seeks between different spindles, so I put the system files on a different RAID1-set to /export (or /home) filesystems. HTH, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From gus at ldeo.columbia.edu Sun Jan 17 18:03:25 2010 From: gus at ldeo.columbia.edu (Gustavo Correa) Date: Sun, 17 Jan 2010 21:03:25 -0500 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: <4B4FC3FD.6040302@ldeo.columbia.edu> Message-ID: Hi Rahul I've got Rmax/Rpeak around 84% on the cluster (AMD Opteron Shanghai, IB on a single switch). I didn't have the cluster available to play with HPL for too long, not too much tuning, I had to move to production mode. Some folks on mailing lists said they'd get 90%, but the topmost group in Top500 get less (as of mid-2009 it was ~75%, IIRR), probably because of their big networks with stacked switches and communication overhead. To optimize in a single node, apply also the formula for Nmax, using the node's RAM. P and Q (block matrix decomposition) tend to be optimal when they are close to each other. With Nehalem you may have to consider the extra complexity of symmetric multi-threading (hyperthreading), and whether it makes or doesn't make a difference on very regular problems like HPL, with big loops and not much branching/ifs. (Your real world computational chemistry problems probably are not like that.) Have you tried HPL with and without SMT/hyperthreading? It maybe worth testing on a single node at least. I hope this helps. Gus Correa On Jan 16, 2010, at 10:02 PM, Rahul Nabar wrote: > On Thu, Jan 14, 2010 at 7:25 PM, Gus Correa wrote: > >> >> First, to test, run HPL in a single node or a few nodes, >> using small values of N, say 1000 to 20000. >> >> The maximum value of N can be approximated by >> Nmax = sqrt(0.8*Total_RAM_on_ALL_nodes_in_bytes/8). >> This uses all the RAM, but doesn't get into memory paging. >> >> Then run HPL on the whole cluster with the Nmax above. >> Nmax pushes the envelope, and is where your >> best performance (Rmax/Rpeak) is likely to be reached. >> Try several P/Q combinations for Nmax (see the TUNING file). >> > > Thanks Gus! That helps a lot. I have Linpak running now on just a > single server and am trying to tune and hit the Rpeak. > > I'm getting 62 Gflops but I think my peak has to be around 72 (2.26 > GHz 8 cores Nehalem). On a single server test do you manage to hit the > theoretical peak?What's a good Rmax / Rpeak to shoot for while tuning? > > Once I am confident I'm well tuned on one server I'll try and extend > it to the whole cluster. > > -- > Rahul From alscheinine at tuffmail.us Sun Jan 17 19:13:28 2010 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Sun, 17 Jan 2010 21:13:28 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B539883.8040003@abdn.ac.uk> References: <4B539883.8040003@abdn.ac.uk> Message-ID: <4B53D1D8.1060909@tuffmail.us> I had a nightmare problem with a newly compiled kernel not booting. The problem may have been with the command mkblkdevs of nash but in any case I did extensive web search on the AHCI controller that I had in both a notebook and a desktop computer, both of which had the booting problem. One example of the controller is Intel ICH9M-E/M SATA AHCI Controller In web postings there were many problems attributed to this controller. It does RAID 1 using software. The Linux driver was (is?) a "work in progress" from what I gather reading web postings. Dr. A.J. Travis wrote: > The problem with using an on-board RAID controller is that many of > these are 'host' RAID (i.e. need a Windows driver to do the RAID) > in which case you are using the CPU anyway, and they also use > proprietary formats. My point is to underline this fact. Hardware RAID 1 should be simple and reliable. But when the RAID controller relies on software running on the O/S, then it might be better to use Linux software RAID. Alan -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 From landman at scalableinformatics.com Sun Jan 17 19:36:43 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 17 Jan 2010 22:36:43 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: Message-ID: <4B53D74B.9000301@scalableinformatics.com> Rahul Nabar wrote: > If I have a option between doing Hardware RAID versus having software > raid via mdadm is there a clear winner in terms of performance? Or is Depends upon workload, writes vs reads, streaming vs random IO, number of simultaneous readers/writers. There is no real clear answer. > the answer only resolvable by actual testing? I have a fairly fast > machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. Testing is a good thing. Sadly, too many people test *after* they purchased something (only to discover what is meant by the term "marketing benchmark numbers"). > Should I be using the vendor's hardware RAID or mdadm? In case a Ohhh ... it depends. Some of the "vendors" hardware raid ... heck ... most of it ... is rebadged LSI gear. Usually their lower end stuff which is sometimes fake-raid. Use fake-raid only if no other options exist. More in a moment. > generic answer is not possible, what might be a good way to test the > two options? Any other implications that I should be thinking about? Benchmark your load using a load generator like fio. > > Finally, there;s always hybrid approaches. I could have several small > RAID5's at the hardware level (RIAD5 seems ok since I have smaller > disks ~300 GB so not really in the domain where the RAID6 arguments > kick in, I think) Then using LVM I can integrate storage while asking RAID6 kicks in purely from the second correlated disk failure scenario. This is size independent. It happens, and you need to be prepared. > LVM to stripe across these RAID5's. Thus I'd get striping at two > levels: LVM (software) and RAID5 (hardware). LVM is not a performance tool. Use it to help you manage things, not speed things. Our own testing puts our 24 bay DV4 unit at a bit more than 1GB/s sustained read (large block sequential) in RAID6, with writes in the 400-500 MB/s region (large block sequential). This is MD RAID based. Our "equivalent" JR4 system clocks in at nearly double the read speed, and about 3+ x the write speed. This is a hardware RAID system. Your mileage will vary ... tremendously ... as a function of your IO pattern. My own suggestion is to test before you buy. After you buy, well, its a bit harder to change your mind. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sun Jan 17 19:53:04 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 21:53:04 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53D74B.9000301@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> Message-ID: On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman wrote: > > Ohhh ... it depends. ?Some of the "vendors" hardware raid ... heck ... most > of it ... is rebadged LSI gear. ?Usually their lower end stuff which is > sometimes fake-raid. ?Use fake-raid only if no other options exist. Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? > LVM is not a performance tool. ?Use it to help you manage things, not speed > things. I had thought so. But why then does LVM have features like striping if not for performance? Or are they just not so good? -- Rahul From rpnabar at gmail.com Sun Jan 17 19:55:57 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 21:55:57 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B539883.8040003@abdn.ac.uk> References: <4B539883.8040003@abdn.ac.uk> Message-ID: Thanks Tony for the helpful tips! On Sun, Jan 17, 2010 at 5:08 PM, Tony Travis wrote: > > However, I would avoid relying on LVM mirroring for data protection. It is > much safer to stripe a set of RAID1's using LVM. I don't think LVM is useful > unless you are managing a disk farm. The commonest issue in disk perfomance > is decoupling seeks between different spindles, so I put the system files on > a different RAID1-set to /export (or /home) filesystems. My problem is that I have several different "storage boxes" each running on RAID5. But I use LVM to aggregate this storage. While doing this I noticed that LVM offers striping too. THat's what got me thinking.... -- Rahul From landman at scalableinformatics.com Sun Jan 17 19:58:23 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 17 Jan 2010 22:58:23 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: <4B53D74B.9000301@scalableinformatics.com> Message-ID: <4B53DC5F.3040601@scalableinformatics.com> Rahul Nabar wrote: > On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman > wrote: >> Ohhh ... it depends. Some of the "vendors" hardware raid ... heck ... most >> of it ... is rebadged LSI gear. Usually their lower end stuff which is >> sometimes fake-raid. Use fake-raid only if no other options exist. > > Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? Its a software RAID implementation pretending to be a hardware RAID implementation. They are rarely if ever as good as MD. Many of them in Linux will invoke dm (the "other" RAID engine) as dm has "support" for fake-raid. Note that we have lost data (multiple times) with dm+fake-raid in testing, so we don't recommend its use in important machines (ones which you can't afford to lose). This could be due to bad drivers for the chips in question, but we aren't taking chances. > > >> LVM is not a performance tool. Use it to help you manage things, not speed >> things. > > I had thought so. But why then does LVM have features like striping if > not for performance? Or are they just not so good? LVM doesn't perform as well as MD RAID for performance. You can use it, just be advised that you are leaving a great deal of performance on the table if you do so. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sun Jan 17 20:01:50 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 22:01:50 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: On Sun, Jan 17, 2010 at 9:58 PM, Joe Landman wrote: > Its a software RAID implementation pretending to be a hardware RAID > implementation. ?They are rarely if ever as good as MD. ?Many of them in > Linux will invoke dm (the "other" RAID engine) as dm has "support" for > fake-raid. ?Note that we have lost data (multiple times) with dm+fake-raid > in testing, so we don't recommend its use in important machines (ones which > you can't afford to lose). ?This could be due to bad drivers for the chips > in question, but we aren't taking chances. Ah! Thanks for the tip. I thought the line between hardware and software was clear cut. Any way to oust such an impostor RAID? What signs would it show? If the hardware in fact does use system software and CPU I can find out by looking for a specific daemon etc? Or maybe CPU loads? -- Rahul From gerry.creager at tamu.edu Sun Jan 17 20:24:51 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Sun, 17 Jan 2010 22:24:51 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53D1D8.1060909@tuffmail.us> References: <4B539883.8040003@abdn.ac.uk> <4B53D1D8.1060909@tuffmail.us> Message-ID: <4B53E293.5000407@tamu.edu> Hardware RAID, in my experience, works well with most LSI controllers (that haven't been modified to become Dell PERC controllers) and 3Ware controllers. I've had pretty grim results with most others. A colleague and I had great initial results with several ARECA controllers, but then they lost their minds and did strange things to our RAID'd volumes. Different Linux distro's and no operator intervention at the times of failure, either. Hardware RAID should use a real onboard controller and be real hardware RAID. A lot of 'em use a "driver" which relies on the OS to actually do the RAID but with some proprietary bits that I don't know and can't see when they break. In this case I'd rather use MD s/w RAID. That said, if I can use a current 3Ware or LSI card (note caveat above; I've not had good performance with any PERC's) I'd rather do hardware RAID for simplicity and recoverability. But, you have to do your own due diligence and know your hardware. gerry Alan Louis Scheinine wrote: > I had a nightmare problem with a newly compiled kernel not booting. > The problem may have been with the command mkblkdevs of nash but in > any case I did extensive web search on the AHCI controller that I had > in both a notebook and a desktop computer, both of which had the > booting problem. > > One example of the controller is > Intel ICH9M-E/M SATA AHCI Controller > > In web postings there were many problems attributed to this controller. > It does RAID 1 using software. The Linux driver was (is?) a "work in > progress" > from what I gather reading web postings. > > Dr. A.J. Travis wrote: >> The problem with using an on-board RAID controller is that many of > > these are 'host' RAID (i.e. need a Windows driver to do the RAID) > > in which case you are using the CPU anyway, and they also use > > proprietary formats. > > My point is to underline this fact. Hardware RAID 1 should be simple and > reliable. But when the RAID controller relies on software running on the > O/S, then it might be better to use Linux software RAID. > > Alan > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rpnabar at gmail.com Sun Jan 17 20:46:21 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sun, 17 Jan 2010 22:46:21 -0600 Subject: [Beowulf] running the Linpak -HPL benchmark. In-Reply-To: References: <4B4FC3FD.6040302@ldeo.columbia.edu> Message-ID: On Sun, Jan 17, 2010 at 8:03 PM, Gustavo Correa wrote: > I've got Rmax/Rpeak around 84% on the cluster (AMD Opteron Shanghai, IB on a single switch). > I didn't have the cluster available to play with HPL for too long, not too much tuning, > I had to move to production mode. > Some folks on mailing lists said they'd get 90%, but the topmost group in Top500 get less > (as of mid-2009 it was ~75%, IIRR), probably because of their big networks > with stacked switches and communication overhead. Thanks Gus! After further tuning I am at 95% on a single node. But performance is falling drastically when i go multinode. Maybe because I only have a 1 GigE network right now. -- Rahul From gerry.creager at tamu.edu Sun Jan 17 20:54:52 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Sun, 17 Jan 2010 22:54:52 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: <4B53E99C.6080905@tamu.edu> +1: Reality. Joe Landman wrote: > Rahul Nabar wrote: >> On Sun, Jan 17, 2010 at 9:36 PM, Joe Landman >> wrote: >>> Ohhh ... it depends. Some of the "vendors" hardware raid ... heck >>> ... most >>> of it ... is rebadged LSI gear. Usually their lower end stuff which is >>> sometimes fake-raid. Use fake-raid only if no other options exist. >> >> Thanks Joe! What's "fake RAID"? Just a bad implementation or...........? > > Its a software RAID implementation pretending to be a hardware RAID > implementation. They are rarely if ever as good as MD. Many of them in > Linux will invoke dm (the "other" RAID engine) as dm has "support" for > fake-raid. Note that we have lost data (multiple times) with > dm+fake-raid in testing, so we don't recommend its use in important > machines (ones which you can't afford to lose). This could be due to > bad drivers for the chips in question, but we aren't taking chances. > >> >> >>> LVM is not a performance tool. Use it to help you manage things, not >>> speed >>> things. >> >> I had thought so. But why then does LVM have features like striping if >> not for performance? Or are they just not so good? > > LVM doesn't perform as well as MD RAID for performance. You can use it, > just be advised that you are leaving a great deal of performance on the > table if you do so. > > >> > > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From forum.san at gmail.com Mon Jan 18 01:13:05 2010 From: forum.san at gmail.com (Sangamesh B) Date: Mon, 18 Jan 2010 14:43:05 +0530 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: <4B4EA6DB.8070207@cs.earlham.edu> References: <4B4EA6DB.8070207@cs.earlham.edu> Message-ID: Hello all, Thanks for your suggestions. But we lost the access to the cluster because of the delay. But I got useful information to debug next time. Thanks, Sangamesh On Thu, Jan 14, 2010 at 10:38 AM, Skylar Thompson wrote: > Sangamesh B wrote: > > Hi HPC experts, > > > > I seek your advise/suggestion to resolve a storage(NAS) server' > > repeated hanging problem. > > > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > > capacity 12 TB is connected to a management server Sun Fire X4150 > > installed with RHEL 5.3 and this server is connected to a Gigabit > > switch which provides cluster private network. The home directories on > > the cluster are NFS mounted from storage partitions across all nodes > > including the master. > > > > This server gets hanged repeatedly. As an initial troubleshooting > > we installed Ganglia, to check network utilization. But its normal. > > We're not getting how to troubleshoot it and resolve the problem. Can > > anybode help us resolve this issue? > Is there anything amiss according to the service processor? > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at staff.uni-marburg.de Mon Jan 18 03:30:03 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 18 Jan 2010 12:30:03 +0100 Subject: [Beowulf] Need some advise: Sun storage' management server hangs repeatedly In-Reply-To: References: <4B4EA6DB.8070207@cs.earlham.edu> Message-ID: <05BCD392-37D0-422D-8BB3-8E6BCE1497ED@staff.uni-marburg.de> Hi, Am 18.01.2010 um 10:13 schrieb Sangamesh B: > Hello all, > > Thanks for your suggestions. > But we lost the access to the cluster because of the delay. but the access to the service processor should still be there, and I think Skylar referred to the ILOM interface. -- Reuti > > But I got useful information to debug next time. > > Thanks, > Sangamesh > On Thu, Jan 14, 2010 at 10:38 AM, Skylar Thompson > wrote: > Sangamesh B wrote: > > Hi HPC experts, > > > > I seek your advise/suggestion to resolve a storage(NAS) server' > > repeated hanging problem. > > > > We've a 23 nodes Rocks-5.1 HPC cluster. The Sun storage of > > capacity 12 TB is connected to a management server Sun Fire X4150 > > installed with RHEL 5.3 and this server is connected to a Gigabit > > switch which provides cluster private network. The home > directories on > > the cluster are NFS mounted from storage partitions across all nodes > > including the master. > > > > This server gets hanged repeatedly. As an initial troubleshooting > > we installed Ganglia, to check network utilization. But its normal. > > We're not getting how to troubleshoot it and resolve the problem. > Can > > anybode help us resolve this issue? > Is there anything amiss according to the service processor? > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From madskaddie at gmail.com Mon Jan 18 06:38:13 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Mon, 18 Jan 2010 14:38:13 +0000 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: <1263676497.14044.15.camel@voltaire.rc.usf.edu> References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> Message-ID: 2010/1/16 Brian Smith : > I'm using this in our environment. ?I've simply added the Modules > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > puppet to manage everything, so this is easy). ?This ensures that > Modules is properly integrated with your environment regardless of > whether you are using an interactive or non-interactive invocation of > these shells. ?This works for SGE (I'm on 6.2u4, ATM) > But it seems that gridengine spawns like "bash script_name" so no rc files are read. Reading bash manpage, I found the BASH_ENV environment variable: """ When bash is started non-interactively, to run a shell script, for example, it looks for the variable BASH_ENV in the environment, expands its value if it appears there, and uses the expanded value as the name of a file to read and execute. Bash behaves as if the following command were executed: if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi but the value of the PATH variable is not used to search for the file name. """ (bash manpage) Right now I'm setting this variable and with the "-V" job submission flag it's working well (it does not work correctly without it) Gil -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From madskaddie at gmail.com Mon Jan 18 11:53:39 2010 From: madskaddie at gmail.com (madskaddie at gmail.com) Date: Mon, 18 Jan 2010 19:53:39 +0000 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: <1263836258.10961.21.camel@voltaire.rc.usf.edu> References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: 2010/1/18 Brian Smith : > Ah, the RedHat-isms that we take for granted... hah! ?I forgot that the > default ~/.bashrc I push out to everyone sources /etc/bashrc by default. > What distro are you using? > Debian lenny > There's also this bit of goodness from the man page: > > "Bash attempts to determine when it is being run with its standard input > connected ?to a a network connection, as if by the remote shell daemon, > usually rshd, or the secure shell daemon sshd. > The Debian bash man page doesn't say the word "sshd" (only "rshd"), and I'm using ssh as the remote shell, so it may be the case (weird, but possible). (...) > > I wonder if sge_shepherd doesn't, in fact, trick shells into behaving > this way... I know I'm not using BASH_ENV and my modules environment > works correctly. > Just to be sure we aren't missing something: you can load a module inside the submit job, correct? Case 1: - module load something - qsub job.sh - cat job.sh #!/bin/bash #(sge config stuff) mpirun ... #EOF Case 2 (what I pretend): - qsub job.sh - cat job.sh #!/bin/bash #(sge config stuff) module add something mpirun ... #EOF > > -Brian > > -- > Brian Smith > Senior Systems Administrator > IT Research Computing, University of South Florida > 4202 E. Fowler Ave. ENB308 > Office Phone: +1 813 974-1467 > Organization URL: http://rc.usf.edu > > > On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: >> 2010/1/16 Brian Smith : >> > I'm using this in our environment. ?I've simply added the Modules >> > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use >> > puppet to manage everything, so this is easy). ?This ensures that >> > Modules is properly integrated with your environment regardless of >> > whether you are using an interactive or non-interactive invocation of >> > these shells. ?This works for SGE (I'm on 6.2u4, ATM) >> > >> >> But it seems that gridengine spawns like "bash script_name" so no rc >> files are read. Reading bash manpage, I found the BASH_ENV environment >> variable: >> >> """ >> When ?bash ?is ?started non-interactively, to run a shell script, for >> example, it looks for the variable BASH_ENV in the environment, >> expands its value if it appears there, and uses the expanded value as >> the name of a file to read and execute. ?Bash behaves as if the >> following command were executed: >> ? ? ? ? ? ? ? if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi >> but the value of the PATH variable is not used to search for the file name. >> """ >> (bash manpage) >> >> Right now I'm setting this variable and with the "-V" job submission >> flag it's working well (it does not work correctly without it) >> >> Gil >> >> > > -- " It can't continue forever. The nature of exponentials is that you push them out and eventually disaster happens. " Gordon Moore (Intel co-founder and author of the Moore's law) From reuti at staff.uni-marburg.de Mon Jan 18 13:03:32 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 18 Jan 2010 22:03:32 +0100 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: <8D2C769A-A86A-4B7D-8787-59228B482070@staff.uni-marburg.de> Hi, Am 18.01.2010 um 20:53 schrieb madskaddie at gmail.com: > 2010/1/18 Brian Smith : >> Ah, the RedHat-isms that we take for granted... hah! I forgot >> that the >> default ~/.bashrc I push out to everyone sources /etc/bashrc by >> default. >> What distro are you using? >> > > Debian lenny > >> There's also this bit of goodness from the man page: >> >> "Bash attempts to determine when it is being run with its standard >> input >> connected to a a network connection, as if by the remote shell >> daemon, >> usually rshd, or the secure shell daemon sshd. >> > > The Debian bash man page doesn't say the word "sshd" (only "rshd"), > and I'm using ssh as the remote shell, so it may be the case (weird, > but possible). > > (...) a) you could define a starter method in SGE's queue setup to define the necessary things. But what functions are these in detail? I checked on one cluster and to me it looks like it's only one: module () { eval `/mypath/environment-modules/3.2.6//Modules/$MODULE_VERSION/ bin/modulecmd bash $*` } and the eval isn't necessary IMO. It would be necessary if $* would include something which has to be interpreted again. So an alias is also working for me (unset module and define it like below): b) alias module='/cm/local/apps/environment-modules/3.2.6//Modules/ $MODULE_VERSION/bin/modulecmd bash $*' or c) you could define a wrapper with a script and put it in /user/ local/bin or alike. -- Reuti >> >> I wonder if sge_shepherd doesn't, in fact, trick shells into beha >> ving >> this way... I know I'm not using BASH_ENV and my modules environment >> works correctly. >> > > Just to be sure we aren't missing something: you can load a module > inside the submit job, correct? > > Case 1: > > - module load something > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > mpirun ... > > #EOF > > Case 2 (what I pretend): > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > module add something > mpirun ... > > #EOF > > > > >> >> -Brian >> >> -- >> Brian Smith >> Senior Systems Administrator >> IT Research Computing, University of South Florida >> 4202 E. Fowler Ave. ENB308 >> Office Phone: +1 813 974-1467 >> Organization URL: http://rc.usf.edu >> >> >> On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: >>> 2010/1/16 Brian Smith : >>>> I'm using this in our environment. I've simply added the Modules >>>> environment code to /etc/bashrc and /etc/csh.cshrc on all nodes >>>> (I use >>>> puppet to manage everything, so this is easy). This ensures that >>>> Modules is properly integrated with your environment regardless of >>>> whether you are using an interactive or non-interactive >>>> invocation of >>>> these shells. This works for SGE (I'm on 6.2u4, ATM) >>>> >>> >>> But it seems that gridengine spawns like "bash script_name" so no rc >>> files are read. Reading bash manpage, I found the BASH_ENV >>> environment >>> variable: >>> >>> """ >>> When bash is started non-interactively, to run a shell script, >>> for >>> example, it looks for the variable BASH_ENV in the environment, >>> expands its value if it appears there, and uses the expanded >>> value as >>> the name of a file to read and execute. Bash behaves as if the >>> following command were executed: >>> if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi >>> but the value of the PATH variable is not used to search for the >>> file name. >>> """ >>> (bash manpage) >>> >>> Right now I'm setting this variable and with the "-V" job submission >>> flag it's working well (it does not work correctly without it) >>> >>> Gil >>> >>> >> >> > > > > -- > " > It can't continue forever. The nature of exponentials is that you push > them out and eventually disaster happens. > " > Gordon Moore (Intel co-founder and author of the Moore's law) > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Mon Jan 18 23:26:06 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 19 Jan 2010 01:26:06 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B554350.5070506@gmail.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> Message-ID: On Mon, Jan 18, 2010 at 11:29 PM, Richard Chang wrote: > > How do we differentiate between the software and hardware RAID > implementations. ANY visual difference?, are they identifiable?. That's a crucial question for me too. I am using the Dell PERC-6-e cards. Any easy ways of telling the FakeRAID's apart? -- Rahul From landman at scalableinformatics.com Tue Jan 19 06:21:10 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 09:21:10 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B554350.5070506@gmail.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> Message-ID: <4B55BFD6.2040404@scalableinformatics.com> Richard Chang wrote: > Joe Landman wrote: >> >> Its a software RAID implementation pretending to be a hardware RAID >> implementation. They are rarely if ever as good as MD. Many of them >> in Linux will invoke dm (the "other" RAID engine) as dm has "support" >> for fake-raid. Note that we have lost data (multiple times) with >> dm+fake-raid in testing, so we don't recommend its use in important >> machines (ones which you can't afford to lose). This could be due to >> bad drivers for the chips in question, but we aren't taking chances. >> >> > Hello Joe, > I would like to know specifically what models of LSI boxes are software > RAID implementation pretending to be a hardware RAID implementation. I Hi Richard: This I cannot tell you, as I don't have a comprehensive list of what uses what driver. I'd suggest looking at what drivers it loads for disks when it comes up. If dmraid comes up *and* enumerates devices, you have a strong probability that it is a fake-raid. This is not to say dmraid is bad. Again, its the underlying driver or chipset that we often run into problems with. > have a few LSI boxes where I work, and your post made me think if they > really are Hardware RAID implementation. > > I have the LSI 2822(old), LSI 4900 & also LSI 7900 controllers based > storage. > > How do we differentiate between the software and hardware RAID > implementations. ANY visual difference?, are they identifiable?. Rarely. Fake raid will generally not have any RAM cache or battery backup capability. In some instances, fake raid is *ok* for OS drives (RAID1 only), if the bios is smart enough to use it correctly, the underlying fake raid driver is relatively stable, and you have reasonable disks. Otherwise, mdadm works great, though you have to patch Redhat/Centos, as they, by default, use dmraid for the moment. Later model Fedora appear to have switched to MD raid (after 9 from what I saw, last time I played with it). > > Richard. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Tue Jan 19 06:43:11 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 09:43:11 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55C4FF.1050108@scalableinformatics.com> Joe Landman wrote: > This I cannot tell you, as I don't have a comprehensive list of what > uses what driver. I'd suggest looking at what drivers it loads for > disks when it comes up. If dmraid comes up *and* enumerates devices, > you have a strong probability that it is a fake-raid. This is not to > say dmraid is bad. Again, its the underlying driver or chipset that we > often run into problems with. I should also point out that the presence of dmraid and device enumeration is still not sufficient for determining whether something is or is not a fake-raid. To wit root at crunch:~# df -h /data Filesystem Size Used Avail Use% Mounted on /dev/mapper/2001b4d2306a71820-part1 4.6T 2.4T 2.2T 53% /data and /data is on a most assuredly a hardware accelerated RAID. There were some additional tools I installed with the distribution which also installed device-mapper. I could turn it off, but then I have some other bits to work around. It's easier to leave it on. FWIW: I am no fan of device mapper (the dm part of dmraid). It has caused us some serious grief in the past (serious grief == data lossage). Its not in a league with things like rieserfs, ext2, NTFS, and whatnot ... When device mapper works correctly (as above) it works fine. A tautology/Yogi-Berra-ism for sure. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From robh at dongle.org.uk Tue Jan 19 07:50:36 2010 From: robh at dongle.org.uk (Robert Horton) Date: Tue, 19 Jan 2010 15:50:36 +0000 Subject: [Beowulf] Filesystem benchmarks Message-ID: <1263916236.6962.72.camel@moelwyn.maths.qmul.ac.uk> Hi, I'm trying to run some benchmarks on a file server to test the effect of different filesystems, hardware vs software raid, putting the journal on a separate device, etc. The machine has 24G of RAM so I'm running /opt/iozone/bin/iozone -c -C -g 48g -n 48g -i 0 -i 1 -i 2 -q 1m -y 4k -a The trouble with this is that it's taking an absolute age - the first random read has been going for about 6 hrs so far. I'm therefore wondering about using either -e (to include a flush) or -I (to bypass the cache) with smaller file sizes. Will this still give a useful comparison? Presumably with -e the read data is still potentially read from cache? Any thoughts? Should I be trying a different approach altogether? Thanks, Rob From jlforrest at berkeley.edu Tue Jan 19 09:12:27 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 19 Jan 2010 09:12:27 -0800 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55E7FB.7080305@berkeley.edu> On 1/19/2010 6:21 AM, Joe Landman wrote: > Rarely. Fake raid will generally not have any RAM cache or battery > backup capability. Not only that, but it won't have any hardware to do parity calculations. (It might be hard to recognize such hardware). > In some instances, fake raid is *ok* for OS drives (RAID1 only), if the > bios is smart enough to use it correctly, the underlying fake raid > driver is relatively stable, and you have reasonable disks. It doesn't take much extra work to do RAID0 or RAID1 so whether this is done by a fake raid driver or the md raid driver probably isn't significant from the resource usage point of view. The only advantage I can think of for fake raid is that there's usually a BIOS of sorts in the fake raid card that lets you manipulate the raid units. This might be more convenient than having to boot Linux and mess with mdadm commands. For RAID levels that require parity calculations, then having a hardware RAID card is a win because the card does a lot of work and hides both the parity calculations and required IOs from the host system. On the third hand, if you have a system with lots of CPU and I/O capacity that wouldn't otherwise get used, then it could be argued that a hardware RAID card is an unnecessary expense. In the old days it was easier to decide to go with hardware RAID. These days it's best to do test with both hardware and software RAID, and then see if the measured improvements of hardware RAID (if any) justify its expense. Of course, in any production system you'll want a few extra RAID cards lying around just in case. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From a.travis at abdn.ac.uk Tue Jan 19 10:22:45 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Tue, 19 Jan 2010 12:22:45 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55E7FB.7080305@berkeley.edu> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> Message-ID: <4B55F875.70501@abdn.ac.uk> Jon Forrest wrote: > [...] > The only advantage I can think of for fake raid is > that there's usually a BIOS of sorts in the fake > raid card that lets you manipulate the raid units. > This might be more convenient than having to boot > Linux and mess with mdadm commands. Hello, Jon and Joe. I use Adaptec 'host' RAID controllers as a way of adding SATA ports to old motherboards that don't have any on-board, configure them in SATA (i.e. non-RAID) mode and build "md" software RAID's using the ports. > For RAID levels that require parity calculations, then > having a hardware RAID card is a win because the card > does a lot of work and hides both the parity calculations > and required IOs from the host system. On the third hand, > if you have a system with lots of CPU and I/O capacity > that wouldn't otherwise get used, then it could be argued > that a hardware RAID card is an unnecessary expense. It has been argued before that, these days, "md" software RAID often performs better because the 'host' CPU is considerably more powerful than the embedded processor on a 'hardware' RAID controller. However, one point that is often overlooked, and the reason I chose a hybrid approach is that AFAIK "md" RAID's do not support hot-swap. I would be very interested to know if anyone is using hot-swap "md" RAID's in production servers: I do realise that development work is going on. > In the old days it was easier to decide to go with > hardware RAID. These days it's best to do test with > both hardware and software RAID, and then see if > the measured improvements of hardware RAID (if any) > justify its expense. Of course, in any production system > you'll want a few extra RAID cards lying around just > in case. Yes, I agree with that! A great virtue of "md" RAID's is that they are independant of the underlying disk controller, and you can easily replace broken controllers or motherboards. If you don't have a spare RAID controller supporting the proprietary format your shiny 'hardware' RAID is using then you can't access your data :-( Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From landman at scalableinformatics.com Tue Jan 19 10:43:30 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 13:43:30 -0500 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55F875.70501@abdn.ac.uk> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> Message-ID: <4B55FD52.206@scalableinformatics.com> Tony Travis wrote: > It has been argued before that, these days, "md" software RAID often > performs better because the 'host' CPU is considerably more powerful > than the embedded processor on a 'hardware' RAID controller. However, > one point that is often overlooked, and the reason I chose a hybrid > approach is that AFAIK "md" RAID's do not support hot-swap. I would be > very interested to know if anyone is using hot-swap "md" RAID's in > production servers: I do realise that development work is going on. Not entirely correct. SATA where the hot swap (bring device in/out) logic is. And it does (at least in modern kernels) support physical removal/addition of devices. The MD system itself is event driven. You can "automate" device removal/insertion into a unit, and rebuild the RAID as needed ... to a degree. The issue we run into is that occasionally, we have to force a bus scan on the scsi buses to see new SATA drives. Once that is done, some of our other tools automate the incorporation of the new disk within the RAID. > >> In the old days it was easier to decide to go with >> hardware RAID. These days it's best to do test with >> both hardware and software RAID, and then see if >> the measured improvements of hardware RAID (if any) >> justify its expense. Of course, in any production system >> you'll want a few extra RAID cards lying around just >> in case. > > Yes, I agree with that! > > A great virtue of "md" RAID's is that they are independant of the > underlying disk controller, and you can easily replace broken > controllers or motherboards. If you don't have a spare RAID controller > supporting the proprietary format your shiny 'hardware' RAID is using > then you can't access your data :-( In the many RAID cases we have dealt with over the years, we haven't run into this as an issue. That is, while touted as a real tangible benefit of MD RAID, it is of dubious real value in most of the cases we have encountered. Really the benefit is that of being against the change of business conditions for your RAID vendor. If you plan on keeping the same array active until it dies (4-10 years), this could be a consideration. However, you also have to worry about disk availability/compatibility, etc. That is, its not *just* a RAID card issue, its a full stack issue. MD allows you to reduce the risk in various portions of this stack. > > Bye, > > Tony. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From a.travis at abdn.ac.uk Tue Jan 19 13:30:22 2010 From: a.travis at abdn.ac.uk (Tony Travis) Date: Tue, 19 Jan 2010 15:30:22 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55FD52.206@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> <4B55FD52.206@scalableinformatics.com> Message-ID: <4B56246E.2050505@abdn.ac.uk> Joe Landman wrote: > [...] > Not entirely correct. SATA where the hot swap (bring device in/out) > logic is. And it does (at least in modern kernels) support physical > removal/addition of devices. The MD system itself is event driven. You > can "automate" device removal/insertion into a unit, and rebuild the > RAID as needed ... to a degree. The issue we run into is that > occasionally, we have to force a bus scan on the scsi buses to see new > SATA drives. Once that is done, some of our other tools automate the > incorporation of the new disk within the RAID. Hello, Joe. The "sdhci" driver in the 2.6 kernel does not notify the kernel of a device change, neither does it flush the kernel buffers. Hot-swapping drives using the standard SATA driver is a great way to corrupt your disks, all it does on a SATA disconnect is try connecting again under the assumption that the same drive is attached but the data rate is too high for the cable - I have practical experience of this problem ;-) I started off my quest to build a COTS RAID5 believing what you just said to be true, but I think there is a popular misconception about SATA: It's true that most modern SATA controllers do support hot-swap electrically, but SATA device drivers to my, albeit limited, knowledge do not notify the kernel that a device has been removed or added. The 3ware 'twe' 'hardware' RAID driver does, in response to events from the RAID controller firmware that is monitoring the physical drives. I've looked at the SATA driver sources quite carefully because I do want to use hot-swap with "md" if that is a *safe* and reliable thing to do. However, I am not confident that it is (yet!). Please correct me if I am wrong, because it would be very useful to be able to *reliably* hot-swap SATA drives on an "md" RAID. I bought a lot of 3ware 8006-2's because I don't trust "md" hot-swapping. The 8006-2 is well supported under Linux. >[...] > In the many RAID cases we have dealt with over the years, we haven't run > into this as an issue. That is, while touted as a real tangible benefit > of MD RAID, it is of dubious real value in most of the cases we have > encountered. I've dealt with quite a few cases myself, where we have upgraded motherboards (esp. Tyan) with completely different on-board RAID, with hit and miss support under Linux. Typically, I've replaced an old or faulty motherboard and left everything else as it was. It's because I was using "md" RAID's that this worked. Now I have a great big pile of 3ware 8006-2's just in case, but I also use the on-board RAID controllers in SATA/AHCI mode to construct "md" RAID's. I responded to Rahul who started this thread because his requirements seemed to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In this context, every penny counts and we do not throw things away until they are actually dead: Old servers become new compute nodes, and so on. I think that lot of people reading this list are interested in running small Beowulf clusters for relatively small projects, like me. I've found the Beowulf list to be a mine of useful information, but we are not all running huge Beowulf clusters or supporting them commerically. > Really the benefit is that of being against the change of business > conditions for your RAID vendor. If you plan on keeping the same array > active until it dies (4-10 years), this could be a consideration. > However, you also have to worry about disk availability/compatibility, > etc. That is, its not *just* a RAID card issue, its a full stack issue. I agree, and I've been bitten by that for using 'enterprise' grade disks that are no longer available and ended up replacing faulty 250GB drives with 500GB drives just so I could rebuild the RAID after a disk failure. I've just repeated the trick replacing 500GB drives with 1TB. It's OK if the replacement drive is bigger, and you're using LBA so drive geometry doesn't matter. > MD allows you to reduce the risk in various portions of this stack. Indeed it does, but I think it would be better with reliable hot-swap! Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis at abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From rpnabar at gmail.com Tue Jan 19 15:06:27 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 19 Jan 2010 17:06:27 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B56246E.2050505@abdn.ac.uk> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> <4B55E7FB.7080305@berkeley.edu> <4B55F875.70501@abdn.ac.uk> <4B55FD52.206@scalableinformatics.com> <4B56246E.2050505@abdn.ac.uk> Message-ID: On Tue, Jan 19, 2010 at 3:30 PM, Tony Travis wrote: > I responded to Rahul who started this thread because his requirements seemed > to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In this > context, every penny counts and we do not throw things away until they are > actually dead: Old servers become new compute nodes, and so on. I think that > lot of people reading this list are interested in running small Beowulf > clusters for relatively small projects, like me. I've found the Beowulf list > to be a mine of useful information, but we are not all running huge Beowulf > clusters or supporting them commerically. I don't know about the others on the list, but you describe my situation pretty accurately Tony! :) Small budget, primitive hardware that's rarely retired etc. Sounds familiar. -- Rahul From jac67 at georgetown.edu Tue Jan 19 15:26:30 2010 From: jac67 at georgetown.edu (Jess Cannata) Date: Tue, 19 Jan 2010 18:26:30 -0500 Subject: [Beowulf] Parallel file systems In-Reply-To: References: Message-ID: <4B563FA6.4000204@georgetown.edu> On 01/13/2010 06:40 AM, tegner at renget.se wrote: > While starting to investigating different storage solutions I came across > gluster (www.gluster.com). I did a search on beowulf.org and came up with > nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. > > Anyone with experience of gluster in HPC? > > Yes, we've been using Glusterfs on one of our lightly used Infiniband clusters (32-nodes, 256 cores). We have found it to be pretty easy to configure and we have liked its performance. If you want more information, you should e-mail Joe Landman, who is also on the list. He's used it in several large setups. > Regards, > > /jon > > From landman at scalableinformatics.com Tue Jan 19 15:51:16 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 19 Jan 2010 18:51:16 -0500 Subject: [Beowulf] Parallel file systems In-Reply-To: <4B563FA6.4000204@georgetown.edu> References: <4B563FA6.4000204@georgetown.edu> Message-ID: <4B564574.9010001@scalableinformatics.com> Jess Cannata wrote: > > > On 01/13/2010 06:40 AM, tegner at renget.se wrote: >> While starting to investigating different storage solutions I came across >> gluster (www.gluster.com). I did a search on beowulf.org and came up with >> nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. >> >> Anyone with experience of gluster in HPC? >> >> > Yes, we've been using Glusterfs on one of our lightly used Infiniband > clusters (32-nodes, 256 cores). We have found it to be pretty easy to > configure and we have liked its performance. If you want more > information, you should e-mail Joe Landman, who is also on the list. > He's used it in several large setups. How did I not see this ... mea culpa Yes, we are using GlusterFS in multiple sites with multiple users. Getting excellent performance out of it, as long as the IB can keep up. Long story ask me over beer some day ... We are generating multiple quotes/RFP responses with it (one is going out literally right now). Bug me offline if you'd like. Joe >> Regards, >> >> /jon >> >> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From brs at usf.edu Sat Jan 16 13:14:57 2010 From: brs at usf.edu (Brian Smith) Date: Sat, 16 Jan 2010 16:14:57 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: Message-ID: <1263676497.14044.15.camel@voltaire.rc.usf.edu> I'm using this in our environment. I've simply added the Modules environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use puppet to manage everything, so this is easy). This ensures that Modules is properly integrated with your environment regardless of whether you are using an interactive or non-interactive invocation of these shells. This works for SGE (I'm on 6.2u4, ATM) Users can then do the following: 1. Include 'module add' directives in their job scripts for application execution (preferred method) 2. Use persistent module add directives w/ 'module initadd' to ensure their jobs have the correct environment settings (good for interactive jobs via qrsh, but this is better solved in other ways). Here's what I added: ## for bash if [ -d "/opt/admin/Modules" ]; then MODULE_VERSION=3.2.6 MODULE_ROOT=/opt/admin/Modules/$MODULE_VERSION case "$0" in -sh|sh|*/sh) modules_shell=sh ;; -ksh|ksh|*/ksh) modules_shell=ksh ;; -zsh|zsh|*/zsh) modules_shell=zsh ;; -bash|bash|*/bash) modules_shell=bash ;; *) modules_shell=bash ;; esac MODULEPATH=$MODULE_ROOT/modulefiles:$HOME/.modulefiles export WORK SCRATCH MODULEPATH MODULE_ROOT MODULE_VERSION module() { eval `$MODULE_ROOT/bin/modulecmd $modules_shell $*`; } if [ -f $HOME/.modules ]; then eval `egrep '^module(.*load|.*add).*$' $HOME/.modules | head -1` fi fi ## For cshell if ( -d /opt/admin/Modules ) then setenv MODULESHOME /opt/admin/Modules/${MODULE_VERSION} if (! $?MODULEPATH ) then setenv MODULEPATH `sed 's/#.*$//' ${MODULESHOME}/init/.modulespath | awk 'NF==1{printf("%s:",$1)}'` endif if (! $?LOADEDMODULES ) then setenv LOADEDMODULES "" endif if ( -f $HOME/.modules ) then eval `egrep '^module(.*add|.*load).*$' $HOME/.modules | head -1` endif endif -- Brian Smith Senior Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB308 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu On Sat, 2010-01-16 at 10:44 +0000, madskaddie at gmail.com wrote: > Greetings, > > I'm using gridengine (6.2u4, open source ver.) and I would like to use > the Modules software. Modules uses a shell function that must be > exported (bash: "export -f func_name" in order to set environment > variables), but gridengine has a bug related with bash exported > functions[1]. > > Is anybody using gridengine, bash and modules? How to solve this? > Changing shell is not an option ;) > > This issue is also being discussed here[2]. > > Thanks, > > Gil > > [1] - http://gridengine.sunsource.net/issues/show_bug.cgi?id=2173 > [2] - http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&viewType=browseAll&dsMessageId=238562#messagefocus > From brs at usf.edu Mon Jan 18 09:37:38 2010 From: brs at usf.edu (Brian Smith) Date: Mon, 18 Jan 2010 12:37:38 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> Message-ID: <1263836258.10961.21.camel@voltaire.rc.usf.edu> Ah, the RedHat-isms that we take for granted... hah! I forgot that the default ~/.bashrc I push out to everyone sources /etc/bashrc by default. What distro are you using? There's also this bit of goodness from the man page: "Bash attempts to determine when it is being run with its standard input connected to a a network connection, as if by the remote shell daemon, usually rshd, or the secure shell daemon sshd. If bash determines it is being run in this fashion, it reads and executes commands from ~/.bashrc, if that file exists and is readable. It will not do this if invoked as sh. The --norc option may be used to inhibit this behavior, and the --rcfile option may be used to force another file to be read, but rshd does not generally invoke the shell with those options or allow them to be specified." I wonder if sge_shepherd doesn't, in fact, trick shells into behaving this way... I know I'm not using BASH_ENV and my modules environment works correctly. -Brian -- Brian Smith Senior Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB308 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: > 2010/1/16 Brian Smith : > > I'm using this in our environment. I've simply added the Modules > > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > > puppet to manage everything, so this is easy). This ensures that > > Modules is properly integrated with your environment regardless of > > whether you are using an interactive or non-interactive invocation of > > these shells. This works for SGE (I'm on 6.2u4, ATM) > > > > But it seems that gridengine spawns like "bash script_name" so no rc > files are read. Reading bash manpage, I found the BASH_ENV environment > variable: > > """ > When bash is started non-interactively, to run a shell script, for > example, it looks for the variable BASH_ENV in the environment, > expands its value if it appears there, and uses the expanded value as > the name of a file to read and execute. Bash behaves as if the > following command were executed: > if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi > but the value of the PATH variable is not used to search for the file name. > """ > (bash manpage) > > Right now I'm setting this variable and with the "-V" job submission > flag it's working well (it does not work correctly without it) > > Gil > > From dmitri.chubarov at gmail.com Mon Jan 18 01:26:00 2010 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Mon, 18 Jan 2010 15:26:00 +0600 Subject: [Beowulf] HPC/mpi courses In-Reply-To: <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> References: <20100117112448.GA1181@wyddfa.dongle.org.uk> <0171F3F7-001B-4E43-B413-F3DE2A7F6054@scinet.utoronto.ca> Message-ID: Jonathan, thanks for the tip regarding the O'Reilly book. I did a search for it and found out that it has been made available for download by the author of the second edition, Dr. Charles Severance http://cnx.org/content/col11136/latest/. The first edition written by Kevin Dowd was going out of print in 1996 when Charles Severance took charge of the second edition and put it online when the second edition ran out. On software optimization for HPC we found the SIAM "cheetah" book "Performance Optimization of Numerically Intensive Codes" by Stefan Goedecker and Adolfy Hoise to be often referred to as the standard reference. The other two books I found on Safari were "Software Optimization for High-Performance Computing" By Kevin R. Wadleigh and Isom L. Crawford published in 2000 with an emphasis on linear algebra and signal processing applications and "The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit published in 2008 that is really good on datastructures and "non-numerical" algorithms. There are probably many more books published by universities as online courses. Also I know a few undergraduate level textbooks in Russian that are unlikely to be ever translated into English. Dima On Mon, Jan 18, 2010 at 5:07 AM, Jonathan Dursi wrote: > > On 2010-01-17, at 6:24AM, Rob Horton wrote: > > On Sat, Jan 16, 2010 at 11:50:48AM +0300, Walid wrote: > >> > >> do you know of any official courses run in Europe, or Asia covering > >> HPC system, or development. mpi or new distributed memory paradigms > >> are welcome. > > > > NAG run various courses on behalf of HECToR in the UK: > > http://www.hector.ac.uk/cse/training/ > > We have videos and slides up of a week-long MPI/OpenMP course we teach at SciNet at the University of Toronto: > > ? ? ? ?http://www.cita.utoronto.ca/~ljdursi/PSP/ > > Videos online are no substitute for being in the classroom yourself, of course, but it's better than nothing. > > Along those lines, does anyone have a good HPC / parallel computing textbook to get users started? ? There are (say) passable MPI books, or OpenMP, or even on the Intel thread building block ?stuff, but very little that integrates everything that I can find. > > Similarly with performance issues; O'Reilly used to have a pretty solid little book on HPC book which was very nice for teaching people to think about serial optimization, but the last edition was 1998 and I can't find anything comparable. > > ? - Jonathan > > -- > Jonathan Dursi > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From brs at usf.edu Mon Jan 18 13:15:10 2010 From: brs at usf.edu (Brian Smith) Date: Mon, 18 Jan 2010 16:15:10 -0500 Subject: [Beowulf] Gridengine and bash + Modules In-Reply-To: References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> Message-ID: <1263849310.2531.3.camel@plato> On Mon, 2010-01-18 at 19:53 +0000, madskaddie at gmail.com wrote: > 2010/1/18 Brian Smith : > > Ah, the RedHat-isms that we take for granted... hah! I forgot that the > > default ~/.bashrc I push out to everyone sources /etc/bashrc by default. > > What distro are you using? > > > > Debian lenny > > > There's also this bit of goodness from the man page: > > > > "Bash attempts to determine when it is being run with its standard input > > connected to a a network connection, as if by the remote shell daemon, > > usually rshd, or the secure shell daemon sshd. > > > > The Debian bash man page doesn't say the word "sshd" (only "rshd"), > and I'm using ssh as the remote shell, so it may be the case (weird, > but possible). > > (...) > > > > > I wonder if sge_shepherd doesn't, in fact, trick shells into behaving > > this way... I know I'm not using BASH_ENV and my modules environment > > works correctly. > > > > Just to be sure we aren't missing something: you can load a module > inside the submit job, correct? > > Case 1: > > - module load something > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > mpirun ... > > #EOF (Unless you are doing qsub -V) This should be - module initadd something - qsub job.sh ... Remember, you need a "module load null" line in ~/.bashrc or, in my case, ~/.modules. This makes sure the module is loaded when bash starts. > Case 2 (what I pretend): > - qsub job.sh > - cat job.sh > #!/bin/bash > #(sge config stuff) > > module add something > mpirun ... > > #EOF > > > > > > > > -Brian > > > > -- > > Brian Smith > > Senior Systems Administrator > > IT Research Computing, University of South Florida > > 4202 E. Fowler Ave. ENB308 > > Office Phone: +1 813 974-1467 > > Organization URL: http://rc.usf.edu > > > > > > On Mon, 2010-01-18 at 14:38 +0000, madskaddie at gmail.com wrote: > >> 2010/1/16 Brian Smith : > >> > I'm using this in our environment. I've simply added the Modules > >> > environment code to /etc/bashrc and /etc/csh.cshrc on all nodes (I use > >> > puppet to manage everything, so this is easy). This ensures that > >> > Modules is properly integrated with your environment regardless of > >> > whether you are using an interactive or non-interactive invocation of > >> > these shells. This works for SGE (I'm on 6.2u4, ATM) > >> > > >> > >> But it seems that gridengine spawns like "bash script_name" so no rc > >> files are read. Reading bash manpage, I found the BASH_ENV environment > >> variable: > >> > >> """ > >> When bash is started non-interactively, to run a shell script, for > >> example, it looks for the variable BASH_ENV in the environment, > >> expands its value if it appears there, and uses the expanded value as > >> the name of a file to read and execute. Bash behaves as if the > >> following command were executed: > >> if [ -n "$BASH_ENV" ]; then . "$BASH_ENV"; fi > >> but the value of the PATH variable is not used to search for the file name. > >> """ > >> (bash manpage) > >> > >> Right now I'm setting this variable and with the "-V" job submission > >> flag it's working well (it does not work correctly without it) > >> > >> Gil > >> > >> > > > > > > > From rchang.lists at gmail.com Mon Jan 18 21:29:52 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Tue, 19 Jan 2010 10:59:52 +0530 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B53DC5F.3040601@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> Message-ID: <4B554350.5070506@gmail.com> Joe Landman wrote: > > Its a software RAID implementation pretending to be a hardware RAID > implementation. They are rarely if ever as good as MD. Many of them > in Linux will invoke dm (the "other" RAID engine) as dm has "support" > for fake-raid. Note that we have lost data (multiple times) with > dm+fake-raid in testing, so we don't recommend its use in important > machines (ones which you can't afford to lose). This could be due to > bad drivers for the chips in question, but we aren't taking chances. > > Hello Joe, I would like to know specifically what models of LSI boxes are software RAID implementation pretending to be a hardware RAID implementation. I have a few LSI boxes where I work, and your post made me think if they really are Hardware RAID implementation. I have the LSI 2822(old), LSI 4900 & also LSI 7900 controllers based storage. How do we differentiate between the software and hardware RAID implementations. ANY visual difference?, are they identifiable?. Richard. From rchang.lists at gmail.com Tue Jan 19 08:44:59 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Tue, 19 Jan 2010 22:14:59 +0530 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B55BFD6.2040404@scalableinformatics.com> References: <4B53D74B.9000301@scalableinformatics.com> <4B53DC5F.3040601@scalableinformatics.com> <4B554350.5070506@gmail.com> <4B55BFD6.2040404@scalableinformatics.com> Message-ID: <4B55E18B.20108@gmail.com> Joe Landman wrote: > > Hi Richard: > > This I cannot tell you, as I don't have a comprehensive list of what > uses what driver. I'd suggest looking at what drivers it loads for > disks when it comes up. If dmraid comes up *and* enumerates devices, > you have a strong probability that it is a fake-raid. This is not to > say dmraid is bad. Again, its the underlying driver or chipset that > we often run into problems with. Thanks Joe, This is a much better picture. I am sure there is no such thing as "dmraid" coming up when the system I maintain, starts. > > > Rarely. Fake raid will generally not have any RAM cache or battery > backup capability. I am also sure, all the storage controllers that I have mentioned have a Battery and RAM Cache. > In some instances, fake raid is *ok* for OS drives (RAID1 only), if > the bios is smart enough to use it correctly, the underlying fake raid > driver is relatively stable, and you have reasonable disks. > Otherwise, mdadm works great, though you have to patch Redhat/Centos, > as they, by default, use dmraid for the moment. Later model Fedora > appear to have switched to MD raid (after 9 from what I saw, last time > I played with it). I really appreciate your effort and the time taken to reply back. Thanks, Richard From sabujp at gmail.com Tue Jan 19 15:56:53 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Tue, 19 Jan 2010 17:56:53 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: <4B563FA6.4000204@georgetown.edu> References: <4B563FA6.4000204@georgetown.edu> Message-ID: Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, Lustre, and XSan/Storenext. Although, my gripe with it is that currently quotas work at the filesystem level and not at the gluster level and this gets messy if you've got a stripe and cluster+distributed setup across the same filesystems on multiple storage bricks. That is, if you fill up your quota on one node in the gluster mounted directory /dist/user (cluster+distributed) and then try to write to /stripe/user (the gluster stripe across all nodes), writes to the stripe will fail. Writes to the distributed directory continue until you run out of your quota on all the nodes. Ideally one should setup multiple filesystems across the nodes, each for the different types of read/write methods, but this isn't always possible or desirable especially if you want one "global" filesystem space. Otherwise, performance is great using infiniband especially to the stripes. Are there other clustered FS out there that use infi, maybe lustre? There's also heavy development on the codebase so it's continually evolving and improving. HTH, Sabuj Pattanayek On Tue, Jan 19, 2010 at 5:26 PM, Jess Cannata wrote: > > > On 01/13/2010 06:40 AM, tegner at renget.se wrote: >> >> While starting to investigating different storage solutions I came across >> gluster (www.gluster.com). I did a search on beowulf.org and came up with >> nothing. gpfs, pvfs and lustre on the other resulted in lots of hits. From kilian.cavalotti.work at gmail.com Wed Jan 20 23:18:54 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 21 Jan 2010 08:18:54 +0100 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: On Wed, Jan 20, 2010 at 12:56 AM, Sabuj Pattanayek wrote: > Otherwise, performance is great using infiniband especially to the > stripes. Are there other clustered FS out there that use infi, maybe > lustre? Yup, plenty of them: Lustre, GPFS, pNFS... Cheers, -- Kilian From d.love at liverpool.ac.uk Thu Jan 21 08:51:45 2010 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 21 Jan 2010 16:51:45 +0000 Subject: [Beowulf] Re: Gridengine and bash + Modules In-Reply-To: <1263849310.2531.3.camel@plato> (Brian Smith's message of "Mon, 18 Jan 2010 16:15:10 -0500") References: <1263676497.14044.15.camel@voltaire.rc.usf.edu> <1263836258.10961.21.camel@voltaire.rc.usf.edu> <1263849310.2531.3.camel@plato> Message-ID: <87eiljcsf2.fsf@liv.ac.uk> I saw this late after reuti referred to it on the SGE list where it was originally raised, and I'm not sure I've got the whole thread. Could someone explain what the problem actually is with SGE and modules, if there really is one? I think there isn't. As far as I (and reuti?) can tell, there's no general problem with `qsub -V' with loaded modules (which is probably natural for command-line submission and what I typically do), or with explicitly loading modules in a job script (which isn't affected by communication within SGE anyway). There is a general problem (and corresponding SGE bug report) with exported shell variables or function definitions with multi-line values, but modules doesn't export the function definition, and its body could be trivially re-written as a simple command (eliding `{...;}') if necessary. From hahn at mcmaster.ca Thu Jan 21 14:15:27 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 21 Jan 2010 17:15:27 -0500 (EST) Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: > Otherwise, performance is great using infiniband especially to the > stripes. I find that people have greatly differing expectations, so what does "great" mean to you? I would expect to achieve ~90% of the peak bandwidth of the interconnect, for instance, assuming enough and fast-enough storage targets. this is easy enough to achieve using Lustre in my experience (albeit I haven't really tried eg QDR IB.) From sabujp at gmail.com Thu Jan 21 14:51:58 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 21 Jan 2010 16:51:58 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: On Thu, Jan 21, 2010 at 4:15 PM, Mark Hahn wrote: >> Otherwise, performance is great using infiniband especially to the >> stripes. > > I find that people have greatly differing expectations, so what does "great" > mean to you? The stripe is across 5 machines and I'm getting about 3 times speedup on writes and 2.5 times speedup on reads vs the same tests going directly to the XFS filesystem on a single node. Not hitting anything close to the peak interconnect bandwidth, about 4gbps on writes and 7gbps on reads. > > I would expect to achieve ~90% of the peak bandwidth of the interconnect, > for instance, assuming enough and fast-enough storage targets. ?this is easy > enough to achieve using ?Lustre in my experience (albeit I haven't really > tried eg QDR IB.) What about your setup above? What sort of speedup are you getting vs a single node, assuming your (storage) nodes are homogeneous in terms of hardware. From robl at mcs.anl.gov Fri Jan 22 07:47:00 2010 From: robl at mcs.anl.gov (Rob Latham) Date: Fri, 22 Jan 2010 09:47:00 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> Message-ID: <20100122154659.GA3743@mcs.anl.gov> On Tue, Jan 19, 2010 at 05:56:53PM -0600, Sabuj Pattanayek wrote: > Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, > Lustre, and XSan/Storenext. I haven't set up any gluster systems, but a point of pride for the PVFS project is our ease of deployment. I'd be interested to hear ways that gluster is even easier to deploy than PVFS > Otherwise, performance is great using infiniband especially to the > stripes. Are there other clustered FS out there that use infi, maybe > lustre? There's also heavy development on the codebase so it's > continually evolving and improving. PVFS has had infiniband support for quite some time. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA From christiansuhendra at gmail.com Thu Jan 21 23:25:48 2010 From: christiansuhendra at gmail.com (christian suhendra) Date: Fri, 22 Jan 2010 16:55:48 +0930 Subject: [Beowulf] mpi_cart_coords : invalid rank Message-ID: hello guys... may ask your advice.. i have a problem here when i connected my program on MPICH i've got this error : root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 ./canon Process 0 of 1 on cluster3 Total Time: 4.161119 msecs root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 2 ./canon Process 1 of 2 on cluster3 Process 0 of 2 on cluster3 0 - MPI_CART_COORDS : Invalid rank [0] Aborting program ! [0] Aborting program! child process exited unexpectedly 0 aborted for information : my RSH and NTFS are working.. please help me..i really need your advice thank you very much regards, christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From sabujp at gmail.com Fri Jan 22 09:27:41 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 22 Jan 2010 11:27:41 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: <20100122154659.GA3743@mcs.anl.gov> References: <4B563FA6.4000204@georgetown.edu> <20100122154659.GA3743@mcs.anl.gov> Message-ID: On Fri, Jan 22, 2010 at 9:47 AM, Rob Latham wrote: > On Tue, Jan 19, 2010 at 05:56:53PM -0600, Sabuj Pattanayek wrote: >> Gluster is the easiest clustered FS to setup vs OCFS2, GFS/GFS2, >> Lustre, and XSan/Storenext. > > I haven't set up any gluster systems, but a point of pride for the > PVFS project is our ease of deployment. ?I'd be interested to hear > ways that gluster is even easier to deploy than PVFS Haven't set up PVFS, but after reading some articles I get the feeling that it can only write data using a striped method across IOD's. Gluster gives you a bit more flexibility and robustness since there's also the distribute write method and mirroring options. From mdidomenico4 at gmail.com Fri Jan 22 10:33:08 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 13:33:08 -0500 Subject: [Beowulf] rhel page_size Message-ID: does anyone know if it's still possible to change the default page_size from 4k to something larger on RHEL v5 x86_64? My efforts to recompile the kernel with a larger page size are failing me, but i might be doing it wrong... thanks From chekh at pcbi.upenn.edu Fri Jan 22 10:56:28 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Fri, 22 Jan 2010 13:56:28 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: References: Message-ID: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> On Fri, 22 Jan 2010 13:33:08 -0500 Michael Di Domenico wrote: > does anyone know if it's still possible to change the default > page_size from 4k to something larger on RHEL v5 x86_64? > > My efforts to recompile the kernel with a larger page size are failing > me, but i might be doing it wrong... Hi Michael, What are you actually trying to accomplish? These links may be relevant: http://en.wikipedia.org/wiki/Page_(computer_memory)#Huge_pages http://dank.qemfd.net/dankwiki/index.php/Pages One place where this comes up is trying to increase the blocksize of ext3; my understanding is that it's not possible in practice: http://en.wikipedia.org/wiki/Ext3#cite_note-7 Regards, -- Alex Chekholko chekh at pcbi.upenn.edu From mdidomenico4 at gmail.com Fri Jan 22 11:26:31 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 14:26:31 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> Message-ID: We'd like to run an experiment and see if MAGMA will run any faster/better/different with a larger page size. On Fri, Jan 22, 2010 at 1:56 PM, Alex Chekholko wrote: > On Fri, 22 Jan 2010 13:33:08 -0500 > Michael Di Domenico wrote: > >> does anyone know if it's still possible to change the default >> page_size from 4k to something larger on RHEL v5 x86_64? >> >> My efforts to recompile the kernel with a larger page size are failing >> me, but i might be doing it wrong... > > Hi Michael, > > What are you actually trying to accomplish? > > These links may be relevant: > http://en.wikipedia.org/wiki/Page_(computer_memory)#Huge_pages > http://dank.qemfd.net/dankwiki/index.php/Pages > > One place where this comes up is trying to increase the blocksize of > ext3; my understanding is that it's not possible in practice: > http://en.wikipedia.org/wiki/Ext3#cite_note-7 > > Regards, > -- > Alex Chekholko ? chekh at pcbi.upenn.edu > From landman at scalableinformatics.com Fri Jan 22 11:49:13 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 22 Jan 2010 14:49:13 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> Message-ID: <4B5A0139.4050003@scalableinformatics.com> Michael Di Domenico wrote: > We'd like to run an experiment and see if MAGMA will run any > faster/better/different with a larger page size. Larger pages will reduce TLB thrashing/pressure. The best way to tell if you need it is to run performance counter tools during your program run. If you haven't, you might be optimizing something which provides minimal if any benefit. Have you profiled the code (using any of the various tools) to see where it is spending its time? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From mdidomenico4 at gmail.com Fri Jan 22 12:14:36 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 22 Jan 2010 15:14:36 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: <4B5A0139.4050003@scalableinformatics.com> References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: Nope, haven't gotten that far yet. I seemed to recall (or atleast i thought i had) that i could easily change the page size during kernel compile. perhaps that was way back in the day or i'm just losing it. figured it'd be a quick easy test... guest not... :( On Fri, Jan 22, 2010 at 2:49 PM, Joe Landman wrote: > Michael Di Domenico wrote: >> >> We'd like to run an experiment and see if MAGMA will run any >> faster/better/different with a larger page size. > > Larger pages will reduce TLB thrashing/pressure. ?The best way to tell if > you need it is to run performance counter tools during your program run. ?If > you haven't, you might be optimizing something which provides minimal if any > benefit. > > Have you profiled the code (using any of the various tools) to see where it > is spending its time? > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web ?: http://scalableinformatics.com > ? ? ? http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax ?: +1 866 888 3112 > cell : +1 734 612 4615 > From ebiederm at xmission.com Sat Jan 23 05:34:27 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 23 Jan 2010 05:34:27 -0800 Subject: [Beowulf] rhel page_size In-Reply-To: (Michael Di Domenico's message of "Fri\, 22 Jan 2010 15\:14\:36 -0500") References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: Michael Di Domenico writes: > Nope, haven't gotten that far yet. I seemed to recall (or atleast i > thought i had) that i could easily change the page size during kernel > compile. perhaps that was way back in the day or i'm just losing it. > figured it'd be a quick easy test... guest not... :( Sounds like ia64 not x86_64. On x86_64 the hardware supported page sizes are 4K 2M and 1G. Only 4K makes sense for general use. Eric From robl at mcs.anl.gov Mon Jan 25 07:22:20 2010 From: robl at mcs.anl.gov (Rob Latham) Date: Mon, 25 Jan 2010 09:22:20 -0600 Subject: [Beowulf] Parallel file systems In-Reply-To: References: <4B563FA6.4000204@georgetown.edu> <20100122154659.GA3743@mcs.anl.gov> Message-ID: <20100125152220.GA21173@mcs.anl.gov> On Fri, Jan 22, 2010 at 11:27:41AM -0600, Sabuj Pattanayek wrote: > Haven't set up PVFS, but after reading some articles I get the feeling > that it can only write data using a striped method across IOD's. > Gluster gives you a bit more flexibility and robustness since there's > also the distribute write method and mirroring options. There are a couple options with regards to PVFS data distribution. "stripe across all" is one, but it's not difficult to change that to stripe across one or some. software mirroring, it's true, remains a research effort. In practice, hardware redundancy serves many sites well. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA From eagles051387 at gmail.com Mon Jan 25 07:28:50 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 25 Jan 2010 16:28:50 +0100 Subject: [Beowulf] clustering using xen virtualized machines Message-ID: has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From h-bugge at online.no Mon Jan 25 07:30:28 2010 From: h-bugge at online.no (=?iso-8859-1?Q?H=E5kon_Bugge?=) Date: Mon, 25 Jan 2010 16:30:28 +0100 Subject: [Beowulf] rhel page_size In-Reply-To: References: <20100122135628.7a76e295.chekh@pcbi.upenn.edu> <4B5A0139.4050003@scalableinformatics.com> Message-ID: On Jan 22, 2010, at 21:14 , Michael Di Domenico wrote: > Nope, haven't gotten that far yet. I seemed to recall (or atleast i > thought i had) that i could easily change the page size during kernel > compile. perhaps that was way back in the day or i'm just losing it. > figured it'd be a quick easy test... guest not... :( I used libhugetlbfs today with success, followed more or less this recipe: http://www.ibm.com/developerworks/systems/library/es-lop-leveragepages/ Huge malloc's could then take advantage of the huge page support on your CPU (2MB on x86_64) and you do not need any kernel/apps recompilation. H?kon From mathog at caltech.edu Mon Jan 25 10:46:31 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 25 Jan 2010 10:46:31 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: Is it possible to have the Machine Check Exception (MCE) information saved to disk automatically on the next warm boot? Long form: A K7 node crashed yesterday and left an MCE on the screen which I copied down as: CPU 0 machine check exception 0000000000000007 Bank 1 F000000000000853 Bank 2 940040000000017A at 00000000001511C0 Kernel panic, not syncing, Unable to Continue Copying all of those numbers down is very error prone. As I understand it the MCE values stay in the registers of the CPU after the crash, and may be retrieved at the next warm boot (via a front panel reset, for instance). But this save seems not to happen automatically, or at least I could not find anything that looked like an MCE dump in /var/log or /var/log/kernel when the system came up. So I want to set things up, if possible to save this information to disk. For what its worth, this is on a Tyan S2466, and while on the next warm boot the hardware monitor in the BIOS showed the CPU fan at full speed, when the OS came up lm_sensors showed it at half speed. I have seen this glitch before on other mysterious crashes, and the only way to clear it seems to be to unplug the unit for 10 minutes, allowing time for the errant bit fade away. This is on a 2.6.24.17 kernel. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From Greg at keller.net Mon Jan 25 12:26:35 2010 From: Greg at keller.net (Greg Keller) Date: Mon, 25 Jan 2010 14:26:35 -0600 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: <201001252000.o0PK08ZT020207@bluewest.scyld.com> References: <201001252000.o0PK08ZT020207@bluewest.scyld.com> Message-ID: > Date: Mon, 25 Jan 2010 10:46:31 -0800 > From: "David Mathog" > Subject: [Beowulf] Logging MCE information on next warm boot? > To: beowulf at beowulf.org > Message-ID: > Content-Type: text/plain; charset=iso-8859-1 > > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? David, I believe the utility you are looking for is mcelog. We usually run it with the following arguments: /usr/sbin/mcelog -h --ignorenodev --filter I think it clears the info after it reports it, so make sure to tee it to a file. I don't understand the command or the flags, just a copy / paste script kiddy in these regards, but I hope it helps. Cheers! Greg From deadline at eadline.org Mon Jan 25 15:23:59 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 25 Jan 2010 18:23:59 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> You may want to look at this: Building A Virtual Cluster with Xen http://www.clustermonkey.net//content/view/139/33/ -- Doug > has anyone tried clustering using xen based vm's. what is everyones take > on > that? its something that popped into my head while in my lectures today. > > -- > Jonathan Aquilina > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From deadline at eadline.org Mon Jan 25 15:30:15 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 25 Jan 2010 18:30:15 -0500 (EST) Subject: [Beowulf] WhisperingWulf: A Silent Personal Cluster Message-ID: <47005.192.168.1.1.1264462215.squirrel@mail.eadline.org> For those of you who hate fan noise (or have some free time, aluminum, and want impress you colleagues), have a look at WhisperingWulf: A Silent Personal Cluster http://www.clustermonkey.net//content/view/273/1/ -- Doug From ashley at pittman.co.uk Mon Jan 25 16:12:28 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 26 Jan 2010 00:12:28 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: > has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ebiederm at xmission.com Mon Jan 25 16:17:07 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Mon, 25 Jan 2010 16:17:07 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: (Greg Keller's message of "Mon\, 25 Jan 2010 14\:26\:35 -0600") References: <201001252000.o0PK08ZT020207@bluewest.scyld.com> Message-ID: Greg Keller writes: >> Date: Mon, 25 Jan 2010 10:46:31 -0800 >> From: "David Mathog" >> Subject: [Beowulf] Logging MCE information on next warm boot? >> To: beowulf at beowulf.org >> Message-ID: >> Content-Type: text/plain; charset=iso-8859-1 >> >> Is it possible to have the Machine Check Exception (MCE) information >> saved to disk automatically on the next warm boot? > > David, > > I believe the utility you are looking for is mcelog. We usually run it with > the following arguments: > /usr/sbin/mcelog -h --ignorenodev --filter > > I think it clears the info after it reports it, so make sure to tee it to a > file. I don't understand the command or the flags, just a copy / paste script > kiddy in these regards, but I hope it helps. In the case of a panic this won't work. You would need to setup kdump or something like that to capture the panic. This sounds like L1 or L2 cache corruption but I haven't ever had any machine checks on anything before the k8 core. Wow. You are talking about old machines. If machine check registers are kept across reboot there is a reasonable chance that the firmware clears them. Eric From lindahl at pbm.com Mon Jan 25 16:45:15 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 25 Jan 2010 16:45:15 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100126004515.GA8936@bx9.net> On Mon, Jan 25, 2010 at 10:46:31AM -0800, David Mathog wrote: > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? You can use a serial or serial-over-lan console to capture it. You can take a photo of the screen. Be sure to send the magic escapes that turn off screen-blanking: echo -e "\033[9;0]" >/dev/console echo -e "\033[13]" >/dev/console -- greg From eagles051387 at gmail.com Mon Jan 25 21:52:51 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 06:52:51 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: woudl somethign like this be useful for lets say setting up a system that works with AI and voice recognition? -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From henning.fehrmann at aei.mpg.de Mon Jan 25 23:58:40 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Tue, 26 Jan 2010 08:58:40 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100126075840.GA4952@gretchen.aei.mpg.de> Hi David, On Mon, Jan 25, 2010 at 10:46:31AM -0800, David Mathog wrote: > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? > > Long form: > > A K7 node crashed yesterday and left an MCE on the screen which I copied > down as: > > CPU 0 machine check exception 0000000000000007 > Bank 1 F000000000000853 > Bank 2 940040000000017A at 00000000001511C0 > Kernel panic, not syncing, Unable to Continue > > Copying all of those numbers down is very error prone. As I understand > it the MCE values stay in the registers of the CPU after the crash, and > may be retrieved at the next warm boot (via a front panel reset, for > instance). But this save seems not to happen automatically, or at least > I could not find anything that looked like an MCE dump in /var/log or > /var/log/kernel when the system came up. So I want to set things up, if > possible to save this information to disk. We loaded the netconsole module. This works at least for the 2.6.27 kernel. AFAIK for older kernel one has to compile it into the kernel. It sends printk messages to a remote syslog-ng server which collects the information. I don't know how much netconsole sends in the case of a panic. netconsole needs paramter: modprobe netconsole netconsole=own_port at onw_ip/NIC,remote_port at remote_IP/remote_mac Cheers, Henning From eagles051387 at gmail.com Tue Jan 26 04:00:34 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 12:00:34 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: does anyone have any benchmarks for I/O in a virtualized cluster? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjrc at sanger.ac.uk Tue Jan 26 05:24:33 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 26 Jan 2010 13:24:33 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: On 26 Jan 2010, at 12:00 pm, Jonathan Aquilina wrote: > does anyone have any benchmarks for I/O in a virtualized cluster? I don't have formal benchmarks, but I can tell you what I see on my VMware virtual machines in general: Network I/O is reasonably fast - there's some additional latency, but nothing particularly severe. VMware can special-case communication between VMs on the same physical host, if required, but that reduces flexibility in moving the VMs around. Disk I/O is fairly poor, especially once the number of virtual machines becomes large. This is hardly surprising - the VMs are contending for shared resources, and there's bound to be more contention in a virtualised setup than in physical machines. In our case (~170 virtual machines running on 9 physical servers, each of which has dual GigE for VM traffic and dual port fibrechannel) Forgive me for using VMware parlance rather than Xen, but hopefully the ideas will be the same. Here are a few things I've noted: 1) Applications with I/O patterns of large numbers of small disk operations are particularly painful (such as our ganglia server with all its thousands of tiny updates to RRD files). We've mitigated this by configuring Linux on this guest to allow a much larger proportion of dirty pages than usual, and to not flush to disk quite so often. OK, so I risk losing more data if the VM goes pop, but this is just ganglia graphing, so I don't really care too much in that particular case. 2) Raw device maps (where you pass a LUN straight through to a single virtual machine, rather than carving the disk out of a datastore) reduce contention and increase performance somewhat, at the cost of using up device minor numbers on ESX quite quickly; because ESX is basically Linux, you're limited to 256 (I think - it might be 128) LUNs presented to each host, and probably to each cluster, since VMs need to be able to migrate. I basically use RDMs for database applications where the storage requirements are greater than about 500 GB. For less than that I use datastores. 3) Keep the number of virtual machines per datastore quite low, especially if the applications are I/O heavy, to reduce contention. 4) In an ideal world I'd spread the datastores over a larger number of RAID units than I currently have, but my budget can't stand that. All this is rather dependent of course on what technology you're using to provide storage to your virtual machines. We're using fibrechannel, but of course mileage may vary considerably if you use NAS or iSCSI, and depending on how many NICs you're bonding together to get bandwidth. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From tjrc at sanger.ac.uk Tue Jan 26 06:24:20 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Tue, 26 Jan 2010 14:24:20 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: On 26 Jan 2010, at 1:24 pm, Tim Cutts wrote: > 2) Raw device maps (where you pass a LUN straight through to a > single virtual machine, rather than carving the disk out of a > datastore) reduce contention and increase performance somewhat, at > the cost of using up device minor numbers on ESX quite quickly; > because ESX is basically Linux, you're limited to 256 (I think - it > might be 128) LUNs presented to each host, and probably to each > cluster, since VMs need to be able to migrate. I basically use RDMs > for database applications where the storage requirements are greater > than about 500 GB. For less than that I use datastores. It's been pointed out to me that of course Linux supports a lot more than 256 devices presented. Nevertheless, for some reason ESX does not - presumably it's only using a single major number or something. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eagles051387 at gmail.com Tue Jan 26 06:38:11 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 15:38:11 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? i honestly dont know much about xen as i just got my hands dirty with it. wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lynesh at Cardiff.ac.uk Tue Jan 26 06:48:48 2010 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Tue, 26 Jan 2010 14:48:48 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: <1264517328.2282.47.camel@w609.insrv.cf.ac.uk> On Tue, 2010-01-26 at 13:24 +0000, Tim Cutts wrote: > 1) Applications with I/O patterns of large numbers of small disk > operations are particularly painful (such as our ganglia server with > all its thousands of tiny updates to RRD files). We've mitigated this > by configuring Linux on this guest to allow a much larger proportion > of dirty pages than usual, and to not flush to disk quite so often. > OK, so I risk losing more data if the VM goes pop, but this is just > ganglia graphing, so I don't really care too much in that particular > case. Ganglia thrashes disks even on physical hardware. So I'm not sure it's fair to lay this at the door of VMWare. We run our ganglia on physical hardware and we still have to put the RRDs in a tmpfs partition to stop the disk I/O grinding the server down. Other than that your experience matches what I've seen with my ESX system (which we don't use for HPC). Thanks, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From john.hearns at mclaren.com Tue Jan 26 07:24:19 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 26 Jan 2010 15:24:19 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? i honestly dont know much about xen as i just got my hands dirty with it.? wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? Jonathan, in a private reply I've already said that you should not be put off from having bright ideas! In no way wishing to rain on your parade - and indeed wishing you to experiment and keep asking questions, which you are very welcome to do, this has been thought of. Cluster nodes are commonly run without and GUI - commandline only, as you say. The debate comes around on this list every so often about running diskless! The answer is yes, you can run diskless compute nodes, and I do. You boot them over the network, and have an NFS-root filesystem. On many clusters the application software is NFS mounted also. Your point about a SAN is very relevant - I would say that direct, physical fibrechannel SAN connections in a cluster are not common - simply due to the expense of installing the cards and a separate infrastructure. However, iSCSI is used and Infiniband is common in clusters. Apologies - I really don't want to come across as knowing better than you (which I don't). If we don't have people asking 2what if" and "hey - here's a good idea" then you won't make anything new. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From eagles051387 at gmail.com Tue Jan 26 08:34:21 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 17:34:21 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> Message-ID: john i thank you for the encouragement. its better then what i get form certain people i deal with in ubuntu channels. you mention diskless booting using tftp and pxe. the problem though arises when u have a certain number of nodes accessing the same disk simultaneously where disk I/O shoots through the roof. only reason i know this i was helping a professor the 2 yrs that i was in college in the usa before transferring we were creating a small cluster 1 head node and 4 slaves. all diskless. its nice thing to have but one thing that puts me off it is the idea of bogging down the drive. in the case of diskless would it be better then to go to a SSD on the head node? -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at mclaren.com Tue Jan 26 08:37:48 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 26 Jan 2010 16:37:48 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F18BB.8000406@sas.upenn.edu> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> > > Is it just me, or does HPC clustering and virtualization fall on > opposite ends of the spectrum? > Gavin, not necessarily. You could have a cluster of HPC compute nodes running a minimal base OS. Then install specific virtual machines with different OS/software stacks each time your run a job. OK, this is probably more relevant for grid or cloud computing - I first thought this would be a good idea when seeing that (at the time) the CERN LHC Grid software would only run with Redhat 7.2 So you could imagine 'packaging up' a virtual machine which has your particular OS flavour/libraries/compilers and shipping it out with the job. Another reason could be fault tolerance - you run VMs on the compute nodes. When you detect a hardware fault is coming along (eg from ECC errors or disk errors) you perform a live migration from one node to another - and your job keeps on trucking. (In theory, checkpointing needed etc. etc.) The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From mathog at caltech.edu Tue Jan 26 08:38:57 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 26 Jan 2010 08:38:57 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: Henning Fehrmann wrote: > We loaded the netconsole module. This works at least for the > 2.6.27 kernel. AFAIK for older kernel one has to compile it into the kernel. Ah good idea, and this distro already has that, but it isn't enabled by default. I see how to configure it and turn it on. Will a logger message for "kern" test it, or is there some other way to force a printk? I'm afraid the logger method might look like it is working, but just go through the usual syslog channels instead of netconsole. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Tue Jan 26 10:46:40 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 26 Jan 2010 10:46:40 -0800 Subject: [Beowulf] Logging MCE information on next warm boot? Message-ID: > David Mathog wrote: > Will a logger > message for "kern" test it, or is there some other way to force a > printk? I'm afraid the logger method might look like it is working, but > just go through the usual syslog channels instead of netconsole. Too optimistic. With netconsole (supposedly) running on the local node logger -p kern.err "test from me" only shows up in the log file on that node. No chance of confusion ;-). There is no explicit network logging of kern.err in /etc/syslog.conf, since I figured syslog is never going to be able to actually log anything after a kernel error. dmesg shows that netconsole started and thinks it is working: netconsole: local port 6666 netconsole: local IP 192.168.1.20 netconsole: interface eth0 netconsole: remote port 514 netconsole: remote IP 192.168.1.220 netconsole: remote ethernet address 00:30:48:59:f8:ff console [netcon0] enabled netconsole: network logging started However, absolutely nothing comes over netconsole when a node reboots. Searched a lot and finally found out how to test netconsole: [root at monkey20 rc6.d]# echo 'p' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 't' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger and it generated these on the syslogd machine Jan 26 10:21:12 monkey20.cluster SysRq : Jan 26 10:21:12 monkey20.cluster Show Regs Jan 26 10:21:35 monkey20.cluster SysRq : Jan 26 10:21:35 monkey20.cluster Show State Jan 26 10:21:52 monkey20.cluster SysRq : Jan 26 10:21:52 monkey20.cluster Show Memory Notice the contentless messages, which were the same as on the video console. This is a log level issue, change it with dmesg or [root at monkey20 rc6.d]# echo '9' > /proc/sysrq-trigger [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger and then a pile of memory information shows up on both the syslog side and the video console. The default log level on these machines is 3. If the kernel panics with it set to that, will the messages that result be "contentless", like the ones above? Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From vanallsburg at hope.edu Tue Jan 26 11:37:10 2010 From: vanallsburg at hope.edu (Paul Van Allsburg) Date: Tue, 26 Jan 2010 14:37:10 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: <4B5F4466.9020403@hope.edu> Ashley Pittman wrote: > On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: > > >> has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. >> > > I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. > > Ashley, > > Hi Ashley, I'd love to try clustering on Amazon. Is there a good writeup somewhere on how to configure & use mpi in the cloud? Thanks! Paul -- Paul Van Allsburg Scientific Computing Specialist Natural Sciences Division, Hope College 35 E. 12th St. Holland, Michigan 49423 616-395-7292 vanallsburg at hope.edu http://www.hope.edu/academic/csm/ From eagles051387 at gmail.com Tue Jan 26 12:28:53 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Tue, 26 Jan 2010 21:28:53 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F4466.9020403@hope.edu> References: <4B5F4466.9020403@hope.edu> Message-ID: do you guys think that virtualized clustering is the future? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Tue Jan 26 15:18:40 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 26 Jan 2010 18:18:40 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> Message-ID: >> Is it just me, or does HPC clustering and virtualization fall on >> opposite ends of the spectrum? depends on your definitions. virtualization certainly conflicts with those aspects of HPC which require bare-metal performance. even if you can reduce the overhead of virtualization, the question is why? look at the basic sort of HPC environment: compute nodes running a single distro, controlled by a scheduler. from the user's or job's perspective, there are just some nodes - which ones doesn't matter, or even how many in total. the user _should_ be able to assume that when they land on a node, it behaves as if freshly installed and booted de novo. we don't reboot nodes nodes between jobs, of course, or even make much effort towards preventing a serial job from noticing other serial jobs on the same node (as containers would, let alone VMs). but we could, without tons of effort, just lower utilization. virtualization is about a few things: - improve utilization by coalescing low-duty-cycle services. - isolate services from each other - either to directly arbitrate runtime resource contention, or to disentangle configurations. - encapsulate all the state of a server so it can be moved. I think the first axis is quite non-HPC, since I don't think of HPC jobs as being like idle services. (OTOH, many clusters have good utilization because multiple workloads get interleaved _above_ the processor level.) the second factor is not often an HPC problem, at least not in my experience, where J Random Fortran user doesn't really care that much about the environment (ie - want f77 and lapack and empty queues). migration has some HPC appeal, since it permits defragmenting a cluster, as well as better preemption. > Gavin, not necessarily. You could have a cluster of HPC compute nodes > running a minimal base OS. > Then install specific virtual machines with different OS/software stacks > each time your run a job. or for each job, just install the provided OS image on the bare metal... your job's done, have it halt or reboot the node ;) > OK, this is probably more relevant for grid or cloud computing - I first grid and cloud computing are all part of the same game, no? along with massively parallel low-latency MPI, old-style vector supercomputing, GPU-assisted computing, throughput serial farming, etc. > thought this would be a good idea when seeing > that (at the time) the CERN LHC Grid software would only run with Redhat > 7.2 > So you could imagine 'packaging up' a virtual machine which has your > particular OS flavour/libraries/compilers and shipping > it out with the job. right, that's one of the axes of the problem-space: whether the app gets its own custom runtime environment (in the sense of kernel, libc, etc). another axis is the degree to which the app has to contend for resources (as in an overcommited normal cluster, or a VM without guaranteed resources.) > Another reason could be fault tolerance - you run VMs on the compute > nodes. When you detect a hardware fault is coming along > (eg from ECC errors or disk errors) you perform a live migration from > one node to another - and your job keeps on trucking. > (In theory, checkpointing needed etc. etc.) I'm pretty skeptical about this - the main issue with checkpointing is when there are external side-effects. checkpointing networked apps (including MPI) is hard because you have state "in flight", so can only freeze-dry the state by quiescing (letting the messages land, etc). the "live migration" demos I've seen have been apps that are tolerant to the loss in-flight transactions (or which retry automatically). so I don't think virt is any kind of paradigm-changer, just like manycore merely stretches existing definitions. -mark From dag at sonsorol.org Tue Jan 26 16:02:57 2010 From: dag at sonsorol.org (Chris Dagdigian) Date: Tue, 26 Jan 2010 19:02:57 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B5F82B1.10805@sonsorol.org> One of the virtualization trends I do see in HPC/clustering is in the area of packaging up entire scientific applications into their own custom VMs which contain all the necessary libraries, software dependencies etc. There is a performance hit now and implementation is clunky but I can see cases where "each app sits in its own VM" and the VMs get launched across a cluster would helpful. This sort of work is trending upwards with Amazon AWS and other infrastructure providers - it can be easier to blast your workflow out into 'the cloud' if it's all wrapped up in a self contained and super portable VM. Given how many different versions of R and other core tools like Perl etc. that I need to support on heterogenous scientific clusters this could be a good trend, heh. Just my $.02 -Chris From eagles051387 at gmail.com Tue Jan 26 23:16:48 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 08:16:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F82B1.10805@sonsorol.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> Message-ID: chris not only the vm being portable yes you would take a hit yet from my research into xen it seems like the paid version of citrix xen server has some other nice features such as migration to a back up machine in case of hardware failure. when you all say performance hit how much of a hit are we talking about. also if you guys are running on bare metal a number of complex computations arent you sharing resources that way as well? On Wed, Jan 27, 2010 at 1:02 AM, Chris Dagdigian wrote: > > One of the virtualization trends I do see in HPC/clustering is in the area > of packaging up entire scientific applications into their own custom VMs > which contain all the necessary libraries, software dependencies etc. > > There is a performance hit now and implementation is clunky but I can see > cases where "each app sits in its own VM" and the VMs get launched across a > cluster would helpful. > > This sort of work is trending upwards with Amazon AWS and other > infrastructure providers - it can be easier to blast your workflow out into > 'the cloud' if it's all wrapped up in a self contained and super portable > VM. > > Given how many different versions of R and other core tools like Perl etc. > that I need to support on heterogenous scientific clusters this could be a > good trend, heh. > > Just my $.02 > > -Chris > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From carsten.aulbert at aei.mpg.de Tue Jan 26 23:33:24 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed, 27 Jan 2010 08:33:24 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <201001270833.25086.carsten.aulbert@aei.mpg.de> Hi David, On Tuesday 26 January 2010 19:46:40 David Mathog wrote: > The default log level on these machines is 3. If the kernel panics with > it set to that, will the messages that result be "contentless", like the > ones above? Try dmesg -n 8 to raise the logging level and try echo '<7>David Test' > /dev/kmsg That should produce output: Jan 27 08:32:24 10.10.12.43 [3098843.050122] David Test The 7 is the logging "severity", try that with <0> and you should send a message to anyone on the system. Does this help? cheers Carsten -- Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics Callinstrasse 38, 30167 Hannover, Germany Phone/Fax: +49 511 762-17185 / -17193 http://www.top500.org/system/9234 | http://www.top500.org/connfam/6/list/3 CaCert Assurer | Get free certificates from http://www.cacert.org/ From henning.fehrmann at aei.mpg.de Wed Jan 27 00:29:58 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Wed, 27 Jan 2010 09:29:58 +0100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <20100127082958.GA3644@gretchen.aei.mpg.de> Hi David, On Tue, Jan 26, 2010 at 10:46:40AM -0800, David Mathog wrote: > > David Mathog wrote: > > Will a logger > > message for "kern" test it, or is there some other way to force a > > printk? I'm afraid the logger method might look like it is working, but > > just go through the usual syslog channels instead of netconsole. > > Too optimistic. With netconsole (supposedly) running on the local node Correct. > > logger -p kern.err "test from me" > > only shows up in the log file on that node. No chance of confusion ;-). > There is no explicit network logging of kern.err in /etc/syslog.conf, > since I figured syslog is never going to be able to actually log > anything after a kernel error. > > dmesg shows that netconsole started and thinks it is working: > > netconsole: local port 6666 > netconsole: local IP 192.168.1.20 > netconsole: interface eth0 > netconsole: remote port 514 > netconsole: remote IP 192.168.1.220 > netconsole: remote ethernet address 00:30:48:59:f8:ff > console [netcon0] enabled > netconsole: network logging started > > However, absolutely nothing comes over netconsole when a node reboots. > > Searched a lot and finally found out how to test netconsole: > > [root at monkey20 rc6.d]# echo 'p' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 't' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger > > and it generated these on the syslogd machine > > Jan 26 10:21:12 monkey20.cluster SysRq : > Jan 26 10:21:12 monkey20.cluster Show Regs > Jan 26 10:21:35 monkey20.cluster SysRq : > Jan 26 10:21:35 monkey20.cluster Show State > Jan 26 10:21:52 monkey20.cluster SysRq : > Jan 26 10:21:52 monkey20.cluster Show Memory > > Notice the contentless messages, which were the same as on the video > console. This is a log level issue, change it with dmesg or > > [root at monkey20 rc6.d]# echo '9' > /proc/sysrq-trigger > [root at monkey20 rc6.d]# echo 'm' > /proc/sysrq-trigger > > and then a pile of memory information shows up on both the syslog side > and the video console. > > The default log level on these machines is 3. If the kernel panics with > it set to that, will the messages that result be "contentless", like the > ones above? Hmmm, we had no kernel panics since we set up netconsole. I also don't know how much a NIC is affected by a panic. I tried to find something in the kernel source. At least the panic message has the log level KERN_EMERG so something should go through. I guess it is a matter of experience. I'd start with log level 7 which can be reduced any time. Cheers, Henning From john.hearns at mclaren.com Wed Jan 27 01:24:13 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 27 Jan 2010 09:24:13 -0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com><4B5F18BB.8000406@sas.upenn.edu><68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> HPCwire have a feature on HPC Cloud computing: http://www.hpcwire.com/specialfeatures/cloud_computing/ The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From chris at csamuel.org Mon Jan 25 16:48:50 2010 From: chris at csamuel.org (Chris Samuel) Date: Tue, 26 Jan 2010 11:48:50 +1100 Subject: [Beowulf] Logging MCE information on next warm boot? In-Reply-To: References: Message-ID: <201001261148.56168.chris@csamuel.org> Hi David, Apologies for the personal copy but emails to the list from my new address are being moderated and I suspect the moderator is away at present.. On Tue, 26 Jan 2010 05:46:31 am David Mathog wrote: > Is it possible to have the Machine Check Exception (MCE) information > saved to disk automatically on the next warm boot? Depending on your kernel version it may well do that by default, for instance both 2.6.20 and 2.6.28 (to pick at random from git) say: /* Log the machine checks left over from the previous reset. This also clears all registers */ do_machine_check(NULL, mce_bootlog ? -1 : -2); Greg mentions mcelog, well that will write output to a file but if that data doesn't make it to spinning rust before the machine locks up then you're out of luck as it'll have cleared the MCE log as part of its action. :-( There is parsemce by Dave Jones [1], apparently you can parse through some of the parameters you get - for instance for your error I get: $ ./parsemce -e 0000000000000007 -b 2 -a 00000000001511C0 -s 940040000000017A Status: (7) Machine Check in progress. Error IP valid Restart IP valid. parsebank(2): 940040000000017a @ 1511c0 External tag parity error Correctable ECC error Address in addr register valid Error enabled in control register Memory heirarchy error Request: Generic error Transaction type : Generic Memory/IO : I/O IIRC that means that you took a machine check whilst there was already a MCE happening, and that becomes an uncorrectable error and the box will die. [1] - http://www.codemonkey.org.uk/projects/parsemce/parsemce.c If you can upgrade to a current kernel (2.6.3x) you can enable the new EDAC code which will decode MCEs in the kernel and process/log them there which might yield better information for you (and might even make it to a remote syslog if they don't make it to the local platters). Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part. URL: From pc7 at sanger.ac.uk Tue Jan 26 07:24:25 2010 From: pc7 at sanger.ac.uk (Peter Clapham) Date: Tue, 26 Jan 2010 15:24:25 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: Message-ID: <4B5F0929.7080800@sanger.ac.uk> On the AWS ec2 side, we've been performing a range of tests including full genome sequencing pipelines across varying numbers of nodes and storage. The biggest challenge to date has been IO, particularly if the smaller image systems are used. Where jobs are highly cpu bound, little network (or heaven forbid disk) bound things go reasonably well and have the potential to scale. Once IO becomes a factor the scaling decreases rapidly... We've also had a run around with Xen and it requires more network tiffling to automate role outs (at least in our environment) but it works ok, especially when paired with something like openQRM. It's a ways off being as polished as VMware and some of the interesting memory handling doesn't appear to be all there. As a result performance degrades rapidly as the number of hosts and IO hungry app load increases fairly severely. Regrettably I don't have enough useful data to present this at present and as always YMMV. Pete > I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. > > Ashley, > > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From carlosf at cesga.es Tue Jan 26 08:25:31 2010 From: carlosf at cesga.es (Carlos Fernandez Sanchez) Date: Tue, 26 Jan 2010 17:25:31 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> Message-ID: <040618AB773740DDA96583DAD8ED476B@pccarlosf2> Another reference that might be worth lookig at: Executing SGE Clusters on top of Hybrid Clouds using OpenNebula: http://www.opennebula.org/lib/exe/fetch.php?id=outreach&cache=cache&media=constantino_vazquez_-_opennebula_-_executing_sge_clusters_on_top_of_hybrid_clouds_using_opennebula.ppt Regards, Carlos Fernandez Sanchez Systems Manager CESGA Avda. de Vigo s/n. Campus Sur Tel.: (+34) 981569810, ext. 232 15705 - Santiago de Compostela SPAIN -------------------------------------------------- From: "Douglas Eadline" Sent: Tuesday, January 26, 2010 12:23 AM To: "Jonathan Aquilina" Cc: "Beowulf Mailing List" Subject: Re: [Beowulf] clustering using xen virtualized machines > > You may want to look at this: > > Building A Virtual Cluster with Xen > http://www.clustermonkey.net//content/view/139/33/ > > -- > Doug > > >> has anyone tried clustering using xen based vm's. what is everyones take >> on >> that? its something that popped into my head while in my lectures today. >> >> -- >> Jonathan Aquilina >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From bug at sas.upenn.edu Tue Jan 26 08:30:51 2010 From: bug at sas.upenn.edu (Gavin Burris) Date: Tue, 26 Jan 2010 11:30:51 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org> <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B5F18BB.8000406@sas.upenn.edu> Is it just me, or does HPC clustering and virtualization fall on opposite ends of the spectrum? With virtualization, you are pooling many virtual OS/server instances on high availablility hardware, sharing memory and cpu as demanded, oversubscribing. What would be idle time on one server, is utilized by another loaded server. With HPC clustering, you are running many physical OS/server instances that usually do not need to be highly available, but instead need to have direct access and total utilization of memory, cpu and storage. If queuing is done well, all servers are maxed out for performance under load. With xen/vmware/amazon clusters, it seems that you would be adding the complexity and cost of a virtualization infrastructure, with few of the benefits that virtualization is targeted to solve. Cheers. On 01/26/2010 10:24 AM, Hearns, John wrote: > for starters to save on resourses why not cut out the gui and go commandline to free up some more of the shared resources, and 2ndly wouldnt offloading data storage to a san or nfs storage server mitigate the disk I/O issues? > > i honestly dont know much about xen as i just got my hands dirty with it. wouldnt it be better then using software virtualization since xen takes advantage of the hardware virtualization that most modern processors come with? > > > Jonathan, in a private reply I've already said that you should not be put off from having bright ideas! > > In no way wishing to rain on your parade - and indeed wishing you to experiment and keep asking questions, > which you are very welcome to do, this has been thought of. > > Cluster nodes are commonly run without and GUI - commandline only, as you say. > The debate comes around on this list every so often about running diskless! The answer is yes, you can run diskless compute > nodes, and I do. You boot them over the network, and have an NFS-root filesystem. > On many clusters the application software is NFS mounted also. > > Your point about a SAN is very relevant - I would say that direct, physical fibrechannel SAN connections in a cluster are > not common - simply due to the expense of installing the cards and a separate infrastructure. However, iSCSI is used and > Infiniband is common in clusters. > > > Apologies - I really don't want to come across as knowing better than you (which I don't). If we don't have people asking 2what if" and "hey - here's a good idea" then you won't make anything new. > > > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From Z.Wu at leeds.ac.uk Mon Jan 25 08:54:08 2010 From: Z.Wu at leeds.ac.uk (Zhili Wu) Date: Mon, 25 Jan 2010 16:54:08 +0000 Subject: [Beowulf] CFP -- (SMLA 2010) Scalable Machine Learning and Applications Workshop Message-ID: <15AB31C404448F4A918D077CA7E74C6E012B1989ADEE@HERMES7.ds.leeds.ac.uk> [Please accept our apologies if you receive multiple copies of this email] CALL FOR PAPERS The 2010 International Workshop on Scalable Machine Learning and Applications (SMLA-10) To be held in conjunction with CIT'10 (Supported by IEEE Computer Society), June 29 - July 1, 2010, Bradford, UK http://smlc09.leeds.ac.uk/smla/ http://www.scim.brad.ac.uk/~ylwu/CIT2010/ SCOPE: Machine learning and data mining have been playing an increasing role in many real scenarios, such as web mining, language processing, image search, financial engineering, etc. In these application domains, data are surpassing the scale of terabyte in an ever faster pace, but the techniques for processing and mining them often lag behind in far too many aspects. To deal with billions of web pages, images, transaction records and capacity-intensive audio and video data stream, machine learning and data mining techniques and their underlying computing infrastructure are facing great challenges. In this SMLA workshop we are willing to bring together researchers and practitioners for getting advancement in scalable machine learning and applications. On one hand we expect works on how to dramatically empower existing machine learning and data mining methods via grid/cloud or other novel computing models. On the other hand we value the effort of building or extending machine learning and data mining methods that are scalable to huge datasets. Papers can be related to any subset of the following topics, or any unconventional direction to scale up machine learning and data mining methods: -- Cloud Computing -- Large Scale Data Mining -- Fast Support Vector Machines -- Data Abstraction, Dimension Reduction -- User Personalization and Recommendation -- Natural Language Processing -- Ontology and Semantic Technologies -- Parallelization of Machine Learning Methods -- Fast Machine Learning Model Tuning and Selection -- Large Scale Webpage Topic, Genre, Sentiment Classification -- Financial Engineering STEERING COMMITTEE Chih-Jen Lin, Natinal Taiwan University, Taiwan Serge Sharoff, University of Leeds, UK Katja Markert, University of Leeds, UK Ivor Wai-Hung Tsang, Nanyang Technological University, Singapore PROGRAM CHAIRS Zhili Wu, University of Leeds, UK Xiaolong Jin, University of Bradford, UK PUBLICITY CHAIRS Evi Syukur, University of New South Wales, Australia Lei Liu, University of Bradford, UK PROGRAM COMMITTEE Please refer to http://smlc09.leeds.ac.uk/smla/committee.htm for a complete list of program committee PAPER SUBMISSION: Authors are invited to submit manuscripts reporting original unpublished research and recent developments in the topics related to the workshop. The length of the papers should not exceed 6 pages + 2 pages for overlength charges (IEEE Computer Society Proceedings Manuscripts style: two columns, single-spaced), including figures and references, using 10 fonts, and number each page. Papers should be submitted electronically in PDF format (or postscript) by sending it as an e-mail attachment to Zhili Wu (z.wu at leeds.ac.uk). All papers will be peer reviewed and the comments will be provided to the authors. The accepted papers will be published together with those of other CIT'10 workshops by the IEEE Computer Society Press. *********************************************************************** Distinguished selected papers, after further extensions, will be published in CIT 2010's special issues of the following prestigious SCI-indexed journals: -- The Journal of Supercomputing ?C Springer -- Journal of Computer and System Sciences ?C Elsevier -- Concurrency and Computation: Practice and Experience - John Wiley & Sons *********************************************************************** IMPORTANT DATES: Paper submission: February 15, 2010 Notification of Acceptance: April 01, 2010 Camera-ready due: April 18, 2010 Author registration: April 18, 2010 Conference: June 29 - July 1, 2010 *********************************************************************** From geoff at galitz.org Wed Jan 27 02:42:48 2010 From: geoff at galitz.org (Geoff Galitz) Date: Wed, 27 Jan 2010 11:42:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> References: <38160.192.168.1.1.1264461839.squirrel@mail.eadline.org><68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com><4B5F18BB.8000406@sas.upenn.edu><68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com><4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: I've had the good fortune to be in the HPC and also HA business for a few years (10 years for HPC but only about 4 for HA). Given the current approach for virtualization I don't see that Xen or other virtualization technologies are good for HPC environments if the performance is a paramount concern. Virtualization in an HPC/HA world is mostly beneficial for portability and fail-over. But the added layer for a hypervisor will be significant if your jobs run for an extended period of time. I've seen jobs that run for months... a 7% performance penalty (fairly typical in my experience) over the course of a month is significant. --------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/ From eagles051387 at gmail.com Wed Jan 27 04:08:25 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 13:08:25 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: gavin you mentioned costs, those are only incurred with xen if you need the extra features such as server migration and other features. also if you dont need those extra features couldnt you just live with the free version of xen. On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz wrote: > > > I've had the good fortune to be in the HPC and also HA business for a few > years (10 years for HPC but only about 4 for HA). Given the current > approach for virtualization I don't see that Xen or other virtualization > technologies are good for HPC environments if the performance is a > paramount > concern. > > Virtualization in an HPC/HA world is mostly beneficial for portability and > fail-over. But the added layer for a hypervisor will be significant if > your > jobs run for an extended period of time. I've seen jobs that run for > months... a 7% performance penalty (fairly typical in my experience) over > the course of a month is significant. > > > > --------------------------------- > Geoff Galitz > Blankenheim NRW, Germany > http://www.galitz.org/ > http://german-way.com/blog/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From bug at sas.upenn.edu Wed Jan 27 07:18:49 2010 From: bug at sas.upenn.edu (Gavin Burris) Date: Wed, 27 Jan 2010 10:18:49 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B605959.2000607@sas.upenn.edu> The cost for virtualization is in buying really big hardware, oodles of memory and many many cores, that are capable of running multiple VMs, and having that hardware configured for redundancy, high availability and failover. With an HPC cluster, you are typically buying hardware that is as stripped down and cheap as you can get it. You focus your HPC budget on the sweet-spot processor, the amount of memory, maybe GPUs, maybe interconnect, so you can deploy as many compute server nodes as you can afford. I don't buy the argument that the winning case is packaging up a VM with all your software. If you really are unable to build the required software stack for a given cluster and its OS, I think using something like xCAT to provision stateless compute servers per job is a better option than virtualization. And if you are packaging VMs to blast out to the cloud, I think you will be paying through the nose. This is not a viable option unless there is a major pricing shift. Cheers. On 01/27/2010 07:08 AM, Jonathan Aquilina wrote: > gavin you mentioned costs, those are only incurred with xen if you need > the extra features such as server migration and other features. also if > you dont need those extra features couldnt you just live with the free > version of xen. > > On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz > wrote: > > > > I've had the good fortune to be in the HPC and also HA business for > a few > years (10 years for HPC but only about 4 for HA). Given the current > approach for virtualization I don't see that Xen or other virtualization > technologies are good for HPC environments if the performance is a > paramount > concern. > > Virtualization in an HPC/HA world is mostly beneficial for > portability and > fail-over. But the added layer for a hypervisor will be significant > if your > jobs run for an extended period of time. I've seen jobs that run for > months... a 7% performance penalty (fairly typical in my > experience) over > the course of a month is significant. > > > > --------------------------------- > Geoff Galitz > Blankenheim NRW, Germany > http://www.galitz.org/ > http://german-way.com/blog/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- > Jonathan Aquilina > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Wed Jan 27 08:51:09 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 27 Jan 2010 08:51:09 -0800 Subject: [Beowulf] Re: Logging MCE information on next warm boot? Message-ID: Carsten Aulbert wrote: > > echo '<7>David Test' > /dev/kmsg > > That should produce output: > > Jan 27 08:32:24 10.10.12.43 [3098843.050122] David Test > > The 7 is the logging "severity" That was a good tip. Using the default dmesg setting and entering <0> -> <3> into the message string it logged across the network, <4> and up it did not. So netconsole seems to be working as expected. The mapping of numbers to types (error, emerg, etc.) seems not to be in the man files, or at least I have not found it there, but is in: /usr/include/sys/syslog.h and is #define LOG_EMERG 0 /* system is unusable */ #define LOG_ALERT 1 /* action must be taken immediately */ #define LOG_CRIT 2 /* critical conditions */ #define LOG_ERR 3 /* error conditions */ #define LOG_WARNING 4 /* warning conditions */ #define LOG_NOTICE 5 /* normal but significant condition */ #define LOG_INFO 6 /* informational */ #define LOG_DEBUG 7 /* debug-level messages */ thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From eagles051387 at gmail.com Wed Jan 27 09:07:48 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 18:07:48 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B605959.2000607@sas.upenn.edu> References: <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: thanks for all yoru responses. i admit i dont have the money at the moment or a job to get my hands dirty with hpc. im planning in the future to setup a rendering cluster. i appreciate all the feed back here. im just wondering now would for instance a head node be of any use running virtualized guest os's or does the head node need to not share the hardware with other os's On Wed, Jan 27, 2010 at 4:18 PM, Gavin Burris wrote: > The cost for virtualization is in buying really big hardware, oodles of > memory and many many cores, that are capable of running multiple VMs, > and having that hardware configured for redundancy, high availability > and failover. > > With an HPC cluster, you are typically buying hardware that is as > stripped down and cheap as you can get it. You focus your HPC budget on > the sweet-spot processor, the amount of memory, maybe GPUs, maybe > interconnect, so you can deploy as many compute server nodes as you can > afford. > > I don't buy the argument that the winning case is packaging up a VM with > all your software. If you really are unable to build the required > software stack for a given cluster and its OS, I think using something > like xCAT to provision stateless compute servers per job is a better > option than virtualization. > > And if you are packaging VMs to blast out to the cloud, I think you will > be paying through the nose. This is not a viable option unless there is > a major pricing shift. > > Cheers. > > > On 01/27/2010 07:08 AM, Jonathan Aquilina wrote: > > gavin you mentioned costs, those are only incurred with xen if you need > > the extra features such as server migration and other features. also if > > you dont need those extra features couldnt you just live with the free > > version of xen. > > > > On Wed, Jan 27, 2010 at 11:42 AM, Geoff Galitz > > wrote: > > > > > > > > I've had the good fortune to be in the HPC and also HA business for > > a few > > years (10 years for HPC but only about 4 for HA). Given the current > > approach for virtualization I don't see that Xen or other > virtualization > > technologies are good for HPC environments if the performance is a > > paramount > > concern. > > > > Virtualization in an HPC/HA world is mostly beneficial for > > portability and > > fail-over. But the added layer for a hypervisor will be significant > > if your > > jobs run for an extended period of time. I've seen jobs that run for > > months... a 7% performance penalty (fairly typical in my > > experience) over > > the course of a month is significant. > > > > > > > > --------------------------------- > > Geoff Galitz > > Blankenheim NRW, Germany > > http://www.galitz.org/ > > http://german-way.com/blog/ > > > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > > > > -- > > Jonathan Aquilina > > > > > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlforrest at berkeley.edu Wed Jan 27 09:31:49 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed, 27 Jan 2010 09:31:49 -0800 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: <4B607885.1070709@berkeley.edu> At a recent Rocks clustering user's group meeting the recent addition of Rocks support of Xen-based virtual clusters came up. Some of the same questions recently raised on this list were discussed there. One justification for virtual clusters that I hadn't thought of was discussed. This only applies in places with large clusters run by a central computing group but used by various internal customers. Using virtual clusters makes it very easy to supply clusters to customers who need a cluster for a limited period of time. The amount of effort necessary to provision a new cluster is minimal. Nodes can easily and quickly be added, if necessary. This is as opposed to buying a new cluster for a research group, using it for a couple of months, and then turning it off. So, in this case, virtualized clusters have the advantage of being easier to manage. The performance overhead caused by the virtualization is a factor, but it's decreasing as time goes on due to better hardware support of virtualization and cleverer software. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From eagles051387 at gmail.com Wed Jan 27 10:30:14 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed, 27 Jan 2010 19:30:14 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B607885.1070709@berkeley.edu> References: <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> <4B607885.1070709@berkeley.edu> Message-ID: so basically what your saying is something along the lines of a rendering cluster would be a good candidate for this? On Wed, Jan 27, 2010 at 6:31 PM, Jon Forrest wrote: > At a recent Rocks clustering user's group > meeting the recent addition of Rocks support of > Xen-based virtual clusters came up. Some > of the same questions recently raised on this > list were discussed there. > > One justification for virtual clusters that I > hadn't thought of was discussed. This only applies > in places with large clusters run by a central > computing group but used by various internal > customers. Using virtual clusters makes it > very easy to supply clusters to customers > who need a cluster for a limited period of > time. The amount of effort necessary to > provision a new cluster is minimal. > Nodes can easily and quickly be added, > if necessary. This is as opposed to buying > a new cluster for a research group, using it > for a couple of months, and then turning it > off. > > So, in this case, virtualized clusters have > the advantage of being easier to manage. The > performance overhead caused by the virtualization > is a factor, but it's decreasing as time goes > on due to better hardware support of virtualization > and cleverer software. > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlforrest at berkeley.edu Wed Jan 27 10:35:39 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed, 27 Jan 2010 10:35:39 -0800 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> <4B607885.1070709@berkeley.edu> Message-ID: <4B60877B.6060803@berkeley.edu> On 1/27/2010 10:30 AM, Jonathan Aquilina wrote: > so basically what your saying is something along the lines of a > rendering cluster would be a good candidate for this? I'm saying nothing about how a virtualized cluster could or should be used. I'm only commenting about how a virtualized cluster might be easier to deal with from a central management point of view. -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From hahn at mcmaster.ca Thu Jan 28 07:10:20 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 28 Jan 2010 10:10:20 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B605959.2000607@sas.upenn.edu> References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: > I don't buy the argument that the winning case is packaging up a VM with > all your software. If you really are unable to build the required > software stack for a given cluster and its OS, I think using something you're right, but only for narrow-function clusters. suppose you have a cluster used by 2k users across a handful of different universities and 100 departments. and have, let's say, 2 staff. it's conceivable that using VMs would permit a higher level of service by putting more configuration flexibility into the hands of the users. yes, most would use a standard image (which might be the bare-metal one, actually), but making it easier to accommodate variance is valuable. it even offers the ability to shift the model - instead of actually booting VMs on nodes for a job, how about just resurrecting a number of VM instances (freeze-dried in already-booted state)? that makes the setup latency potentially much lower. (pages from a VM image can be fetched lazily afaik, and presumably also COW.) for the few HPC-oriented performance studies of VMs I've seen, the only slowdowns were for OS activity (IO, page allocation, etc). an ideally-behaved HPC app minimizes those already, so... From hahn at mcmaster.ca Thu Jan 28 07:17:25 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 28 Jan 2010 10:17:25 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: > im just wondering now would for instance a head node be of any use running > virtualized guest os's or does the head node need to not share the hardware > with other os's well, the HA-ish motive for VMs has some application to the admin portions of even a pure HPC clustre. for instance, your jobs may execute on bare metal, but there is some appeal to putting various cluster support services into their own VMs. for instance, most clusters need DHCP and TFTP (eg for PXE) - but that's a fairly lightweight service that could be VM'ed. you'd lose a bit of performance, but gain the ability to switch physical hosts, can still share physical hosts with other services, and insulate the service from random insult like OS/library upgrades. you can always roll back to a known-good config. this is not a huge breakthrough, since such services are not all that fragile in the first place. in a sense, part of the value-add of using a VM is encapsulating a bunch of system settings in a way that's otherwise spread across multiple files. regards, mark hahn. From tjrc at sanger.ac.uk Thu Jan 28 08:14:23 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu, 28 Jan 2010 16:14:23 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: On 28 Jan 2010, at 3:10 pm, Mark Hahn wrote: >> I don't buy the argument that the winning case is packaging up a VM >> with >> all your software. If you really are unable to build the required >> software stack for a given cluster and its OS, I think using >> something > > you're right, but only for narrow-function clusters. suppose you > have a cluster used by 2k users across a handful of different > universities > and 100 departments. and have, let's say, 2 staff. it's conceivable > that using VMs would permit a higher level of service by putting > more configuration flexibility into the hands of the users. yes, > most would > use a standard image (which might be the bare-metal one, actually), > but making it easier to accommodate variance is valuable. > > it even offers the ability to shift the model - instead of actually > booting VMs on nodes for a job, how about just resurrecting a number > of VM instances (freeze-dried in already-booted state)? that makes > the setup latency potentially much lower. (pages from a VM image can > be fetched lazily afaik, and presumably also COW.) COW is certainly how some of the virtual desktop solutions work; desktop machines are 90% identical in most organisations, so it makes sense to use COW when firing up a new one. So the technology is definitely around. > for the few HPC-oriented performance studies of VMs I've seen, > the only slowdowns were for OS activity (IO, page allocation, etc). > an ideally-behaved HPC app minimizes those already, so... We've certainly seen some interesting behaviour as far as the network is concerned. We tried creating a VM with a Lustre client in it, and have not had much success with that. There's more variability in network latency, and the Lustre servers hate that and keep ejecting the client. We haven't solved the problem yet. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From bug at sas.upenn.edu Thu Jan 28 08:23:30 2010 From: bug at sas.upenn.edu (Gavin Burris) Date: Thu, 28 Jan 2010 11:23:30 -0500 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: <4B61BA02.9000303@sas.upenn.edu> Two staff couldn't handle 2k users and 100 of departments, or that much hardware. Answering tickets or emails alone would be overwhelming. Building/maintaining the VMs, or training/document/helping the departments to build their own VMs is a monumental task in and of itself. A more realistic number is 1 FTE per 4 hpc-using departments. I would wager that generalizing and not targeting any particular performance aspect will only cause the departments to pool their own money and build their own targeted resource, for less money, with a grad student and an oreilly book. I find that most users only have time for their application workflow or domain-specific coding, not to be system programmers making VMs. Sorry, I'm not drinking the virtualization/cloud koolaid. I'd love to have everything abstracted and easy to manage, but I find standardizing on an OS or two and keeping things as stock as possible is easier, and cheaper to manage at this point. In my situation, virtualization just adds complexity and has a price/performance penalty. Cheers. On 01/28/2010 10:10 AM, Mark Hahn wrote: >> I don't buy the argument that the winning case is packaging up a VM with >> all your software. If you really are unable to build the required >> software stack for a given cluster and its OS, I think using something > > you're right, but only for narrow-function clusters. suppose you have a > cluster used by 2k users across a handful of different universities > and 100 departments. and have, let's say, 2 staff. it's conceivable > that using VMs would permit a higher level of service by putting more > configuration flexibility into the hands of the users. yes, most would > use a standard image (which might be the bare-metal one, actually), > but making it easier to accommodate variance is valuable. > > it even offers the ability to shift the model - instead of actually > booting VMs on nodes for a job, how about just resurrecting a number > of VM instances (freeze-dried in already-booted state)? that makes the > setup latency potentially much lower. (pages from a VM image can > be fetched lazily afaik, and presumably also COW.) > > for the few HPC-oriented performance studies of VMs I've seen, > the only slowdowns were for OS activity (IO, page allocation, etc). > an ideally-behaved HPC app minimizes those already, so... > From hahn at mcmaster.ca Thu Jan 28 08:40:52 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 28 Jan 2010 11:40:52 -0500 (EST) Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B61BA02.9000303@sas.upenn.edu> References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> <4B61BA02.9000303@sas.upenn.edu> Message-ID: > Two staff couldn't handle 2k users and 100 of departments, or that much > hardware. so you say. my example is a scaled down version of actual numbers of my organization. of course, much depends on how you define "user" (logged in right now or "getent passwd | wc -l"?) or for that matter how you define staff. > Answering tickets or emails alone would be overwhelming. > Building/maintaining the VMs, or training/document/helping the > departments to build their own VMs is a monumental task in and of > itself. A more realistic number is 1 FTE per 4 hpc-using departments. hah. good for you! > I would wager that generalizing and not targeting any particular > performance aspect will only cause the departments to pool their own > money and build their own targeted resource, for less money, with a grad > student and an oreilly book. there's always some tension between such approaches. but most HPC PIs quickly learn that it hurts their research when they lose grad student time to cluster admin. the time consumed by cluster admin has a large constant factor, and very weak size scaling. > I find that most users only have time for > their application workflow or domain-specific coding, not to be system > programmers making VMs. I'm not sure why you assume such an approach would expect users to build VMs from scratch. customizing a working vm is in principle no harder in principle than submitting a job. > Sorry, I'm not drinking the virtualization/cloud koolaid. I'd love to I should say here that I'm not either - it's just a minor extension of the existing tools. clearly more significant than grids and potentially more useful and general. From tjrc at sanger.ac.uk Thu Jan 28 09:34:25 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Thu, 28 Jan 2010 17:34:25 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B61BA02.9000303@sas.upenn.edu> References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> <4B61BA02.9000303@sas.upenn.edu> Message-ID: On 28 Jan 2010, at 4:23 pm, Gavin Burris wrote: > Sorry, I'm not drinking the virtualization/cloud koolaid. I'd love to > have everything abstracted and easy to manage, but I find > standardizing > on an OS or two and keeping things as stock as possible is easier, and > cheaper to manage at this point. In my situation, virtualization just > adds complexity and has a price/performance penalty. For HPC, I think you're probably right at the moment. But for more run-of-the-mill servers, the cost/benefit and simplification is definitely good. I've got about 9 physical servers providing virtual machines for web servers, development nodes, infrastructure services (mail, and so on) and all that stuff. Those 9 servers are running 170 virtual machines, which in the old days would mostly have been separate boxes, and in many cases redundant pairs for failover. OK, so the servers are meatier, and cost maybe four times what the basic tin we'd have used would have cost for a single server. I still make that about 75% less money on hardware than we would otherwise have spent for that number of services. The saving is much larger than what it cost to buy vSphere. Power consumption is more like a 90% saving - my entire virtualisation setup consumes about 3.6kW (not counting the storage), which is, what, about 20W per VM (and we're not full yet). And it all sits in 9U of rack space. And then there are all the fringe benefits which save me time; simplified storage allocation, reduced deployment time, almost complete elimination of service downtime for hardware maintenance, guest OS patch management and automated remediation (for Windows, SLES and RHAS anyway - most of our machines run Debian which sadly they don't do patch management for). I get HA for free, so I no longer have to fart about with heartbeat and redundant server pairs. I get lock-step fault tolerance for free, too, if I need it, so I can finally get rid of that Marathon abomination. Backups become simpler (meh, just back up the whole VM with Consolidated Backup). You're still right that the management of VM setup takes quite a lot of time, but it's a lot less than if I were having to configure and deploy the same wide variety of services on physical hardware. But Cloud stuff, I'm right with you and slightly skeptical at the moment. Especially for our extremely data-heavy CPU-lite applications. It's more likely to have application in our line of work for the ability to ship arbitrary untrusted code to data. My dream world is for all the sequencing sites to present their data to a cloud interface in a consistent manner, and if I want to analyse, say, Broad's data, I just ship my VM to them and run my analysis there. Similarly, we provide hosts for running their VMs. No more shipping disks around by Fedex, which is what the scientists currently do. But it's probably never going to happen. *sigh*. Far too much "not invented here" politics. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jlforrest at berkeley.edu Thu Jan 28 09:38:14 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Thu, 28 Jan 2010 09:38:14 -0800 Subject: [Beowulf] GPU Beowulf Clusters Message-ID: <4B61CB86.9060105@berkeley.edu> I'm about to spend ~$20K on a new cluster that will be a proof-of-concept for doing GPU-based computing in one of the research groups here. A GPU cluster is different from a traditional HPC cluster in several ways: 1) The CPU speed and number of cores are not that important because most of the computing will be done inside the GPU. 2) Serious GPU boards are large enough that they don't easily fit into standard 1U pizza boxes. Plus, they require more power than the standard power supplies in such boxes can provide. I'm not familiar with the boxes that therefore should be used in a GPU cluster. 3) Ideally, I'd like to put more than one GPU card in each computer node, but then I hit the issues in #2 even harder. 4) Assuming that a GPU can't be "time shared", this means that I'll have to set up my batch engine to treat the GPU as a non-sharable resource. This means that I'll only be able to run as many jobs on a compute node as I have GPUs. This also means that it would be wasteful to put CPUs in a compute node with more cores than the number GPUs in the node. (This is assuming that the jobs don't do anything parallel on the CPUs - only on the GPUs). Even if GPUs can be time shared, given the expense of copying between main memory and GPU memory, sharing GPUs among several processes will degrade performance. Are there any other issues I'm leaving out? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From eagles051387 at gmail.com Thu Jan 28 09:50:16 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Thu, 28 Jan 2010 18:50:16 +0100 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B61CB86.9060105@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> Message-ID: are you goign for the nvidia teslas or you looking to squeeze 4 cards into one box? getting them powered shouldnt be a problem there if you plan on using plane custom built desktops 2000w psus out there if not more now a days. im not sure though wiht the teslas you can quad sli them, and if sli would make any difference in regards to gpu clustered computing -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Thu Jan 28 09:53:42 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 28 Jan 2010 12:53:42 -0500 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B61CB86.9060105@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> Message-ID: The way I do it is, but your mileage may vary... We allocate two CPU's per GPU and use the Nvidia Tesla S1070 1U chassis product. So a standard quad/core - dual/socket server with four GPU's attached We've found that even though you expect the GPU to do most of the work, it really takes a CPU to drive the GPU and keep it busy Having a second CPU to pre-stage/post-stage the memory has worked pretty well also. For scheduling, we use SLURM and allocate one entire node per job, no sharing On Thu, Jan 28, 2010 at 12:38 PM, Jon Forrest wrote: > I'm about to spend ~$20K on a new cluster > that will be a proof-of-concept for doing > GPU-based computing in one of the research > groups here. > > A GPU cluster is different from a traditional > HPC cluster in several ways: > > 1) The CPU speed and number of cores are not > that important because most of the computing will > be done inside the GPU. > > 2) Serious GPU boards are large enough that > they don't easily fit into standard 1U pizza > boxes. Plus, they require more power than the > standard power supplies in such boxes can > provide. I'm not familiar with the boxes > that therefore should be used in a GPU cluster. > > 3) Ideally, I'd like to put more than one GPU > card in each computer node, but then I hit the > issues in #2 even harder. > > 4) Assuming that a GPU can't be "time shared", > this means that I'll have to set up my batch > engine to treat the GPU as a non-sharable resource. > This means that I'll only be able to run as many > jobs on a compute node as I have GPUs. This also means > that it would be wasteful to put CPUs in a compute > node with more cores than the number GPUs in the > node. (This is assuming that the jobs don't do > anything parallel on the CPUs - only on the GPUs). > Even if GPUs can be time shared, given the expense > of copying between main memory and GPU memory, > sharing GPUs among several processes will degrade > performance. > > Are there any other issues I'm leaving out? > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From mathog at caltech.edu Thu Jan 28 12:11:54 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 28 Jan 2010 12:11:54 -0800 Subject: [Beowulf] Re: GPU Beowulf Clusters Message-ID: Jon Forrest wrote: > Are there any other issues I'm leaving out? Yes, the time and expense of rewriting your code from a CPU model to a GPU model, and the learning curve for picking up this new skill. (Unless you are lucky and somebody has already ported the software you use.) Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From dnlombar at ichips.intel.com Thu Jan 28 14:40:15 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 28 Jan 2010 14:40:15 -0800 Subject: [Beowulf] rhel page_size In-Reply-To: References: Message-ID: <20100128224015.GB6191@nlxdcldnl2.cl.intel.com> On Fri, Jan 22, 2010 at 11:33:08AM -0700, Michael Di Domenico wrote: > does anyone know if it's still possible to change the default > page_size from 4k to something larger on RHEL v5 x86_64? > > My efforts to recompile the kernel with a larger page size are failing > me, but i might be doing it wrong... Google for HugeTLBfs, it will be much easier than changing the kernel's page size. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From peter.st.john at gmail.com Fri Jan 29 07:51:21 2010 From: peter.st.john at gmail.com (Peter St. John) Date: Fri, 29 Jan 2010 10:51:21 -0500 Subject: [Beowulf] rhel page_size In-Reply-To: References: Message-ID: I asked a guy at Red Hat, who asked a guy....(reminding me of the chain of Djinn in "Godel Escher Bach") and got this reply: "Hugepages are enabled on RHEL5 by default, so he's welcome to use those. Is there any particular reason he's not using those, and trying a recompile instead?" Peter On Fri, Jan 22, 2010 at 1:33 PM, Michael Di Domenico wrote: > does anyone know if it's still possible to change the default > page_size from 4k to something larger on RHEL v5 x86_64? > > My efforts to recompile the kernel with a larger page size are failing > me, but i might be doing it wrong... > > thanks > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Fri Jan 29 13:55:04 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 29 Jan 2010 21:55:04 +0000 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <4B5F4466.9020403@hope.edu> References: <4B5F4466.9020403@hope.edu> Message-ID: <0517518E-4DD3-4F0C-A75F-81A42206F0F4@pittman.co.uk> On 26 Jan 2010, at 19:37, Paul Van Allsburg wrote: > Ashley Pittman wrote: >> On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: >>> has anyone tried clustering using xen based vm's. what is everyones take on that? its something that popped into my head while in my lectures today. >>> >> >> I've been using Amazon ec2 for clustering for months now, from a software perspective it's very similar to running real hardware. For my needs (development) it's perfectly adequate, I've not benchmarked it against running the same code on the raw hardware though. > > I'd love to try clustering on Amazon. It's really easy. > Is there a good writeup somewhere on how to configure & use mpi in the cloud? I'm not sure one is needed. As a bit of background I develop and support an open source debugging tool for parallel applications (see my sig for details), as such I run a lot of parallel apps but I run them purely to have something to test padb against hence I'm not bothered about performance, I just need a running job to interrogate. What is important for me (or rather my tool) is that it works in different environments so I run with a variety of clustering software. With Amazon I can boot any numbers of machine "instances" and pay $0.85c/h for each one, typically I run four at a time but I've run with up to twenty. Once the instances are booted there is no difference between using them and using real machines. I regularly use Slurm, OpenMPI (ORTE and under Slurm), MPICH2 (mpd, hydra and under slurm) and I've yet to find any way in which the setup differs from running on real metal. For persistent storage I pay for a 'EBS' volume which I attach to one vm and nfs export to the others which use as a shared /home, each instance also comes with a large scratch partition but I typically don't use this at all. I have a bunch of scripts for populating the hosts files and adding user accounts and that's all there is to it. For the EBS volume you simply pick the size you need, create the volume, attach it to a vm and them mkfs.ext3 as normal, this volume is persistent and is charged for by Gb by calendar month rather than instance hour. I can also choose what distro and indeed OS to run, the default is FC8 but it's easy enough to pick something else, I tend to flip between FC8, debian and Solaris every few weeks, this is mostly to ensure my code is well tested in different machines - it does mean re-compiling everything each time I switch which can take a while. I also noticed that over-committing virtual machines doesn't have the same negative impact as over-commiting the CPU's on virtual machines, sure the application performance plummets in either case but the virtual machine is still usable where as a real machine can stop responding almost completely. This means I can over-commit my vm's by running 32 procs per node and run 512 process jobs at a cost of only $1.36 an hour. Cheap enough to be able to try something, see if it works and not have to worry about the cost. In short, Amazon makes a really good development or test system for small scale clusters, it's good for testing code correctness and experimenting with different distos. I'm not convinced about the performance and I'm not convinced about the cost effectiveness or larger or longer running applications but as a place to start it's ideal. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From jeff.johnson at aeoncomputing.com Thu Jan 28 12:27:06 2010 From: jeff.johnson at aeoncomputing.com (Jeff Johnson) Date: Thu, 28 Jan 2010 12:27:06 -0800 Subject: [Beowulf] Re: GPU Beowulf Clusters (Jon Forrest) In-Reply-To: <201001282000.o0SK087c015636@bluewest.scyld.com> References: <201001282000.o0SK087c015636@bluewest.scyld.com> Message-ID: <4B61F31A.7050109@aeoncomputing.com> On 1/28/10 12:00 PM, Jon Forrest wrote: > A GPU cluster is different from a traditional > HPC cluster in several ways: > > 1) The CPU speed and number of cores are not that important because > most of the computing will be done inside the GPU. The GPU will be doing the specific operations called in the application but you need enough CPU to keep the memory operations and PCI I/O handled to feed the GPU. > 2) Serious GPU boards are large enough that they don't easily fit into > standard 1U pizza boxes. Plus, they require more power than the > standard power supplies in such boxes can provide. I'm not familiar > with the boxes that therefore should be used in a GPU cluster. There are 1U systems designs that use a passively cooled GPU that relies on chassis cooling infrastructure like CPUs do that work well. They are matched with the correct power supply size to support the GPU as well as the CPUs, memory, disk, etc. > 3) Ideally, I'd like to put more than one GPU card in each computer > node, but then I hit the issues in #2 even harder. Not in a 1U system unless you use the nVidia S1070 external GPU chassis. Even then, if your application can be bottlenecked by having less than full PCIex16 bandwidth to the GPUs then the S1070 approach would be less than optimal compared to a system that had two dedicated, full speed PCIex16 slots. --Jeff -- ------------------------------ Jeff Johnson Manager Aeon Computing jeff.johnson at aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 From tvsingh at ucla.edu Thu Jan 28 12:57:05 2010 From: tvsingh at ucla.edu (Singh, Tajendra) Date: Thu, 28 Jan 2010 12:57:05 -0800 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: References: <4B61CB86.9060105@berkeley.edu> Message-ID: <43F64E86355A744E9D51506B6C6783B906041C18@EM2.ad.ucla.edu> This is not a problem in your setup as you are assigning a whole node together. In general how one can deal with problem of binding a particular gpu device to scheduler? Sorry if I am asking something which is already known and there are ways to bind the devices within scheduler. Thanks, TV -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Michael Di Domenico Sent: Thursday, January 28, 2010 9:54 AM To: Beowulf Mailing List Subject: Re: [Beowulf] GPU Beowulf Clusters The way I do it is, but your mileage may vary... We allocate two CPU's per GPU and use the Nvidia Tesla S1070 1U chassis product. So a standard quad/core - dual/socket server with four GPU's attached We've found that even though you expect the GPU to do most of the work, it really takes a CPU to drive the GPU and keep it busy Having a second CPU to pre-stage/post-stage the memory has worked pretty well also. For scheduling, we use SLURM and allocate one entire node per job, no sharing On Thu, Jan 28, 2010 at 12:38 PM, Jon Forrest wrote: > I'm about to spend ~$20K on a new cluster > that will be a proof-of-concept for doing > GPU-based computing in one of the research > groups here. > > A GPU cluster is different from a traditional > HPC cluster in several ways: > > 1) The CPU speed and number of cores are not > that important because most of the computing will > be done inside the GPU. > > 2) Serious GPU boards are large enough that > they don't easily fit into standard 1U pizza > boxes. Plus, they require more power than the > standard power supplies in such boxes can > provide. I'm not familiar with the boxes > that therefore should be used in a GPU cluster. > > 3) Ideally, I'd like to put more than one GPU > card in each computer node, but then I hit the > issues in #2 even harder. > > 4) Assuming that a GPU can't be "time shared", > this means that I'll have to set up my batch > engine to treat the GPU as a non-sharable resource. > This means that I'll only be able to run as many > jobs on a compute node as I have GPUs. This also means > that it would be wasteful to put CPUs in a compute > node with more cores than the number GPUs in the > node. (This is assuming that the jobs don't do > anything parallel on the CPUs - only on the GPUs). > Even if GPUs can be time shared, given the expense > of copying between main memory and GPU memory, > sharing GPUs among several processes will degrade > performance. > > Are there any other issues I'm leaving out? > > Cordially, > -- > Jon Forrest > Research Computing Support > College of Chemistry > 173 Tan Hall > University of California Berkeley > Berkeley, CA > 94720-1460 > 510-643-1032 > jlforrest at berkeley.edu > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From eagles051387 at gmail.com Fri Jan 29 23:38:17 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Sat, 30 Jan 2010 08:38:17 +0100 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: <0517518E-4DD3-4F0C-A75F-81A42206F0F4@pittman.co.uk> References: <4B5F4466.9020403@hope.edu> <0517518E-4DD3-4F0C-A75F-81A42206F0F4@pittman.co.uk> Message-ID: then why not just run vm's on the host. also then in that case would it be possible to point pxe and tell it when booting the nodes which image to use? On Fri, Jan 29, 2010 at 10:55 PM, Ashley Pittman wrote: > > On 26 Jan 2010, at 19:37, Paul Van Allsburg wrote: > > Ashley Pittman wrote: > >> On 25 Jan 2010, at 15:28, Jonathan Aquilina wrote: > >>> has anyone tried clustering using xen based vm's. what is everyones > take on that? its something that popped into my head while in my lectures > today. > >>> > >> > >> I've been using Amazon ec2 for clustering for months now, from a > software perspective it's very similar to running real hardware. For my > needs (development) it's perfectly adequate, I've not benchmarked it against > running the same code on the raw hardware though. > > > > I'd love to try clustering on Amazon. > > It's really easy. > > > Is there a good writeup somewhere on how to configure & use mpi in the > cloud? > > > I'm not sure one is needed. As a bit of background I develop and support > an open source debugging tool for parallel applications (see my sig for > details), as such I run a lot of parallel apps but I run them purely to have > something to test padb against hence I'm not bothered about performance, I > just need a running job to interrogate. What is important for me (or rather > my tool) is that it works in different environments so I run with a variety > of clustering software. > > With Amazon I can boot any numbers of machine "instances" and pay $0.85c/h > for each one, typically I run four at a time but I've run with up to twenty. > Once the instances are booted there is no difference between using them and > using real machines. I regularly use Slurm, OpenMPI (ORTE and under Slurm), > MPICH2 (mpd, hydra and under slurm) and I've yet to find any way in which > the setup differs from running on real metal. For persistent storage I pay > for a 'EBS' volume which I attach to one vm and nfs export to the others > which use as a shared /home, each instance also comes with a large scratch > partition but I typically don't use this at all. I have a bunch of scripts > for populating the hosts files and adding user accounts and that's all there > is to it. For the EBS volume you simply pick the size you need, create the > volume, attach it to a vm and them mkfs.ext3 as normal, this volume is > persistent and is charged for by Gb by calendar month rather than instance > hour. > > I can also choose what distro and indeed OS to run, the default is FC8 but > it's easy enough to pick something else, I tend to flip between FC8, debian > and Solaris every few weeks, this is mostly to ensure my code is well tested > in different machines - it does mean re-compiling everything each time I > switch which can take a while. > > I also noticed that over-committing virtual machines doesn't have the same > negative impact as over-commiting the CPU's on virtual machines, sure the > application performance plummets in either case but the virtual machine is > still usable where as a real machine can stop responding almost completely. > This means I can over-commit my vm's by running 32 procs per node and run > 512 process jobs at a cost of only $1.36 an hour. Cheap enough to be able > to try something, see if it works and not have to worry about the cost. > > In short, Amazon makes a really good development or test system for small > scale clusters, it's good for testing code correctness and experimenting > with different distos. I'm not convinced about the performance and I'm not > convinced about the cost effectiveness or larger or longer running > applications but as a place to start it's ideal. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From michf at post.tau.ac.il Sat Jan 30 04:31:45 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Sat, 30 Jan 2010 14:31:45 +0200 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B61CB86.9060105@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> Message-ID: <20100130143145.43de588a@vivalunalitshi.luna.local> On Thu, 28 Jan 2010 09:38:14 -0800 Jon Forrest wrote: > I'm about to spend ~$20K on a new cluster > that will be a proof-of-concept for doing > GPU-based computing in one of the research > groups here. > > A GPU cluster is different from a traditional > HPC cluster in several ways: > > 1) The CPU speed and number of cores are not > that important because most of the computing will > be done inside the GPU. > The speed not so much but the number of cores does matter. You should have at least one core per GPU as the CPU is in charge of scheduling and initiating memory transfers (and if not setup as DMA also handling the memory transfer). When latency is an issue (especially jobs with a lot of CPU related scheduling) the CPU polls the GPU results which can bump CPU usage. Nehalem raises another issue where there is no north side bus and memory goes via the CPU. It is recommended BTW, that you have at least the same amount of system memory as GPU memory, so with tesla it is 4GB per GPU. > 2) Serious GPU boards are large enough that > they don't easily fit into standard 1U pizza > boxes. Plus, they require more power than the > standard power supplies in such boxes can > provide. I'm not familiar with the boxes > that therefore should be used in a GPU cluster. > You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u tesla s1070 pizza box which has 4 tesla GPUs http://www.nvidia.com/object/product_tesla_s1070_us.html or there are several vendors out there that match two tesla GPU (usually the tesla m1060 in this case which is a passively cooled version of the c1060) http://www.nvidia.com/object/product_tesla_m1060_us.html to a dual cpu xeon in a 1u system. you can start here (the links page from nvidia) http://www.nvidia.com/object/tesla_preconfigured_clusters_wtb.html There are other specialized options if you want, but most of them aimed at higher budget clusters. You can push it in terms of power as each tesla takes 160W, adding to that what the cpu and the rest of the system requires, a 1000W power supply should do. The s1070 comes with a 1200w power supply on board. > 3) Ideally, I'd like to put more than one GPU > card in each computer node, but then I hit the > issues in #2 even harder. > You are looking for the tesla s1070 or previously mentioned solutions > 4) Assuming that a GPU can't be "time shared", > this means that I'll have to set up my batch > engine to treat the GPU as a non-sharable resource. > This means that I'll only be able to run as many > jobs on a compute node as I have GPUs. This also means > that it would be wasteful to put CPUs in a compute > node with more cores than the number GPUs in the > node. (This is assuming that the jobs don't do > anything parallel on the CPUs - only on the GPUs). > Even if GPUs can be time shared, given the expense > of copying between main memory and GPU memory, > sharing GPUs among several processes will degrade > performance. > It doesn't have a swap in/swap out mechanism, so the way it may time share is by alternating kernels as long as there is enough memory. Shouldn't be done for HPC (same with CPU by the way due to numa/l2 cache and context switching issues). What you would want to do is to setup the cards in exclusive mode and then tell the users not to choose a card explicitly. The context creation function would then choose the next available card automatically. You would then with the tesla s1070 setup the machine as having 4 cores for scheduling. The processes will be sharing the pci bus though for communications so you may prefer to setup the system as 1 job per machine or at least a round robin scheduler. > Are there any other issues I'm leaving out? > Take note that the s1070 is ~6k$ so you are talking at most two to three machines here with your budget. Also don't even think about putting that s1070 anywhere but a server room, or at least nowhere with users near by as it makes a lot of noise. > Cordially, From gerry.creager at tamu.edu Sat Jan 30 08:46:50 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat, 30 Jan 2010 10:46:50 -0600 Subject: [Beowulf] clustering using xen virtualized machines In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0F0BB6EE@milexchmb1.mil.tagmclarengroup.com> <4B5F18BB.8000406@sas.upenn.edu> <68A57CCFD4005646957BD2D18E60667B0F0BB821@milexchmb1.mil.tagmclarengroup.com> <4B5F82B1.10805@sonsorol.org> <68A57CCFD4005646957BD2D18E60667B0F12AF1B@milexchmb1.mil.tagmclarengroup.com> <4B605959.2000607@sas.upenn.edu> Message-ID: <4B64627A.4010409@tamu.edu> Mark Hahn wrote: >> I don't buy the argument that the winning case is packaging up a VM with >> all your software. If you really are unable to build the required >> software stack for a given cluster and its OS, I think using something > > you're right, but only for narrow-function clusters. suppose you have a > cluster used by 2k users across a handful of different universities > and 100 departments. and have, let's say, 2 staff. it's conceivable > that using VMs would permit a higher level of service by putting more > configuration flexibility into the hands of the users. yes, most would > use a standard image (which might be the bare-metal one, actually), > but making it easier to accommodate variance is valuable. > > it even offers the ability to shift the model - instead of actually > booting VMs on nodes for a job, how about just resurrecting a number > of VM instances (freeze-dried in already-booted state)? that makes the > setup latency potentially much lower. (pages from a VM image can > be fetched lazily afaik, and presumably also COW.) > > for the few HPC-oriented performance studies of VMs I've seen, > the only slowdowns were for OS activity (IO, page allocation, etc). > an ideally-behaved HPC app minimizes those already, so... Coming in a bit late, but I have one minor quibble, Mark: VM network latency. I've seen this be a bottleneck for some of our (non-HPC) VMs on decent hardware and network gear, at least with Xen, and VMWare, before it. The attractive part of the VM Picture is what you stated, though, where the user takes the onus of managing their own image with software. I'm considering this for an urgent-computing model I want to field... gerry From jlforrest at berkeley.edu Sat Jan 30 10:24:09 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Sat, 30 Jan 2010 10:24:09 -0800 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <20100130143145.43de588a@vivalunalitshi.luna.local> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> Message-ID: <4B647949.9030700@berkeley.edu> On 1/30/2010 4:31 AM, Micha Feigin wrote: > It is recommended BTW, that you have at least the same amount of system memory > as GPU memory, so with tesla it is 4GB per GPU. I'm not going to get Teslas, for several reasons: 1) This is a proof of concept cluster. Spending $1200 per graphics card means that the GPUs alone, assuming 2 GPUs, would cost as much as a whole node with 2 consumer-grade cards. (See below) 2) We know that the Fermi cards are coming out soon. If we were going to spend big bucks on GPUs, we'd wait for them. But, our funding runs out before the Fermis will be available. This is too bad but there's nothing I can do about it. See below for comments regarding CPUs and cores. > You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u > tesla s1070 pizza box which has 4 tesla GPUs Since my first post I've learned about the Supermicro boxes that have space for two GPUs (http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) . This looks like a good way to go for a proof-of-concept cluster. Plus, since we have to pay $10/U/month at the Data Center, it's a good way to use space. The GPU that looks the most promising is the GeForce GTX275. (http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR) It has 1792MB of RAM and is only ~$300. I realize that there are better cards but for this proof-of-concept cluster we want to get the best bang for the buck. Later, after we've ported our programs, and have some experience optimizing them, then we'll consider something better, probably using whatever the best Fermi-based card is. The research group that will be purchasing this cluster does molecular dynamics simulations that usually take 24 hours or more to complete using quad-core Xeons. We hope to bring down this time substantially. > It doesn't have a swap in/swap out mechanism, so the way it may time share is > by alternating kernels as long as there is enough memory. Shouldn't be done for > HPC (same with CPU by the way due to numa/l2 cache and context switching > issues). Right. So this means 4 cores should be good enough for 2 GPUs. I wish somebody made a motherboard that would allow 6-core AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the motherboard might not be worth the cost. I'm not sure. > The processes will be sharing the pci bus though for communications so you may > prefer to setup the system as 1 job per machine or at least a round robin > scheduler. This is another reason not to go crazy with lots of cores. They'll be sitting idle most of the time, unless I also create queues for normal non-GPU jobs. > Take note that the s1070 is ~6k$ so you are talking at most two to three > machines here with your budget. Ha, ha!! ~$6K should get me two compute nodes, complete with graphics cards. I appreciate everyone's comments, and I welcome more. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From jlforrest at berkeley.edu Sat Jan 30 17:30:31 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Sat, 30 Jan 2010 17:30:31 -0800 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B64B83B.9080303@pathscale.com> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <4B64B83B.9080303@pathscale.com> Message-ID: <4B64DD37.50201@berkeley.edu> On 1/30/2010 2:52 PM, "C. Bergstr?m" wrote: > Hi Jon, > > I must emphasize what David Mathog said about the importance of the gpu > programming model. I don't doubt this at all. Fortunately, we have lots of very smart people here at UC Berkeley. I have the utmost confidence that they will figure this stuff out. My job is to purchase and configure the cluster. > My perspective (with hopefully not too much opinion added) > OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more > tedious to write and in an effort to stay generic loses the potential to > fully exploit the gpu. At one point the performance of the drivers from > Nvidia was not equivalent, but I think that's been fixed. (This does not > mean all vendors are unilaterally doing a good job) This is very interesting news. As far as I know, nobody is doing anything with OpenCL in the College of Chemistry around here. On the other hand, we've been following all the press about how it's going to be the great unifier so that it won't be necessary to use a proprietary API such as CUDA anymore. At this point it's too early to doing anything with OpenCL until our colleagues in the Computer Science department have made a pass at it and have experiences to talk about. > Have you considered sharing access with another research lab that has > already purchased something similar? > (Some vendors may also be willing to let you run your codes in exchange > for feedback.) There's nobody else at UC Berkeley I know of who has a GPU cluster. I don't know of any vendor who'd be willing to volunteer their cluster. If anybody would like to volunteer, step right up. > 1) sw thread synchronization chews up processor time Right, but let's say right now 80% of the CPU time is spent in routines that will eventually be done in the GPU (I'm just making this number up). I don't see how having a faster CPU would help overall. > 2) Do you already know if your code has enough computational complexity > to outweigh the memory access costs? In general, yes. A couple of grad students have ported some of their code to CUDA with excellent results. Plus, molecular dynamics is well suited to GPU programming, or so I'm told. Several of the popular opensource MD packages have already been ported also with excellent results. > 3) Do you know if the GTX275 has enough vram? Your benchmarks will > suffer if you start going to gart and page faulting The one I mentioned in my posting has 1.8GB of RAM. If this isn't enough then we're in trouble. The grad student I mentioned has been using the 898MB version of this card without problems. > 4) I can tell you 100% that not all gpu are created equally when it > comes to handling cuda code. I don't have experience with the GTX275, > but if you do hit issues I would be curious to hear about them. I've heard that it's much better than the 9500GT that we first started using. Since the 9500GT is a much cheaper card we didn't expect much performance out of it, but the grad student who was trying to use it said that there were problems with it not releasing memory, resulting in having to reboot the host. I don't know the details. > Some questions in return.. > Is your code currently C, C++ or Fortran? The most important program for this group is in Fortran. We're going to keep it in Fortran, but we're going to write C interfaces to the routines that will run on the GPU, and then write these routines in C. > Is there any interest in optimizations at the compiler level which could > benefit molecular dynamics simulations? Of course, but at what price? I'm talking both about both the price in dollars, and the price in non-standard directives. I'm not a chemist so I don't know what would speed up MD calculations more than a good GPU. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From landman at scalableinformatics.com Sat Jan 30 21:38:24 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 31 Jan 2010 00:38:24 -0500 Subject: [Beowulf] Anyone with really large clusters seeing memory leaks with OFED 1.5 for tcp based apps? Message-ID: <4B651750.8080307@scalableinformatics.com> Hi folks Trying to trace something annoying down, and see if we are running into something that is known. OFED 1.5 on a 2.6.30.10 kernel. Running a file system atop IPoIB (many reasons, none I care to get into here at the moment). Under light load, the file system gradually grabs memory. Possibly a leak, not entirely sure. Could be the OFED stack underneath. Backing file system is xfs. That is has been (on this hardware in other situations) rock solid stable. Here, xfs, OFED/IPoIB all toss their cookies (and fail allocations) under moderate to heavy load. Working with the file system vendor on this. I am not sure we have the answer nailed, so I wanted to see who out there is running a big ( >512 nodes) cluster, doing large data transfers (preferably over IPoIB), for data storage, and running a late model OFED. If you fall into this category, please let me know, as I'd like to ask a few questions offline about any observed OFED/IPoIB failure modes. I am not convinced it is OFED/IPoIB, but I'd like to see what other people have run into ... if anything. Thanks! Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From michf at post.tau.ac.il Sun Jan 31 09:06:48 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Sun, 31 Jan 2010 19:06:48 +0200 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B647949.9030700@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> Message-ID: <20100131190648.6d91b2f7@vivalunalitshi.luna.local> On Sat, 30 Jan 2010 10:24:09 -0800 Jon Forrest wrote: > On 1/30/2010 4:31 AM, Micha Feigin wrote: > > > It is recommended BTW, that you have at least the same amount of system memory > > as GPU memory, so with tesla it is 4GB per GPU. > > I'm not going to get Teslas, for several reasons: > > 1) This is a proof of concept cluster. Spending $1200 > per graphics card means that the GPUs alone, assuming > 2 GPUs, would cost as much as a whole node with > 2 consumer-grade cards. (See below) > Be very very sure that consumer geforces can go in 1u boxes. It's not so much the space as much as I'm skeptical with their ability of handling the thermal issues. They are just not designed for this kind of work. Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with the same chip) and are actively cooled, which means that you need to get air flowing into the side fan. That's exactly why they put the tesla m and not the c into those boxes. The geforce driver also throttles the card under load to solve thermal issues. You will probably want to under clock the cards to the tesla spec and be sure to monitor the thermal state. I know someone who works with 3 gtx295 in a desktop box and he initially had some thermal shutdown issues with older drivers. I'm guessing that the newer drivers just throttle the cards more aggressively under load. > 2) We know that the Fermi cards are coming out > soon. If we were going to spend big bucks > on GPUs, we'd wait for them. But, our funding > runs out before the Fermis will be available. > This is too bad but there's nothing I can do > about it. > Check out the mad scientist program, it's supposed to end today, but maybe if you talk to NVidia they can still get you into it (they are rather flexible, esspecially with universities, and they also offer if for companies) http://www.nvidia.com/object/mad_science_promo.html You can buy a current telsa (t10 core) and upgrade it for a fermi (t20 core) when it comes out for the cost difference. May be more cost effective if you do plan to build a fermi cluster later on. It is designed to upgrade to the same line though (c, m or s) so you may want to consider now which one to go with. > See below for comments regarding CPUs and cores. > > > You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u > > tesla s1070 pizza box which has 4 tesla GPUs > > Since my first post I've learned about the Supermicro boxes > that have space for two GPUs > (http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) . > This looks like a good way to go for a proof-of-concept cluster. Plus, > since we have to pay $10/U/month at the Data Center, it's a good > way to use space. > See my previous comment > The GPU that looks the most promising is the GeForce GTX275. > (http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR) > It has 1792MB of RAM and is only ~$300. I realize that there > are better cards but for this proof-of-concept cluster we > want to get the best bang for the buck. Later, after we've > ported our programs, and have some experience optimizing them, > then we'll consider something better, probably using whatever > the best Fermi-based card is. > > The research group that will be purchasing this cluster does > molecular dynamics simulations that usually take 24 hours or more > to complete using quad-core Xeons. We hope to bring down this > time substantially. > > > It doesn't have a swap in/swap out mechanism, so the way it may time share is > > by alternating kernels as long as there is enough memory. Shouldn't be done for > > HPC (same with CPU by the way due to numa/l2 cache and context switching > > issues). > > Right. So this means 4 cores should be good enough for 2 GPUs. > I wish somebody made a motherboard that would allow 6-core > AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the > motherboard might not be worth the cost. I'm not sure. > > > The processes will be sharing the pci bus though for communications so you may > > prefer to setup the system as 1 job per machine or at least a round robin > > scheduler. > > This is another reason not to go crazy with lots of cores. > They'll be sitting idle most of the time, unless I also > create queues for normal non-GPU jobs. > > > Take note that the s1070 is ~6k$ so you are talking at most two to three > > machines here with your budget. > > Ha, ha!! ~$6K should get me two compute nodes, complete > with graphics cards. > > I appreciate everyone's comments, and I welcome more. > > Cordially, From michf at post.tau.ac.il Sun Jan 31 09:33:58 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Sun, 31 Jan 2010 19:33:58 +0200 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B64DD37.50201@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <4B64B83B.9080303@pathscale.com> <4B64DD37.50201@berkeley.edu> Message-ID: <20100131193358.395acc13@vivalunalitshi.luna.local> On Sat, 30 Jan 2010 17:30:31 -0800 Jon Forrest wrote: > On 1/30/2010 2:52 PM, "C. Bergstr?m" wrote: > > > Hi Jon, > > > > I must emphasize what David Mathog said about the importance of the gpu > > programming model. > > I don't doubt this at all. Fortunately, we have lots > of very smart people here at UC Berkeley. I have > the utmost confidence that they will figure this > stuff out. My job is to purchase and configure the > cluster. > > > My perspective (with hopefully not too much opinion added) > > OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more > > tedious to write and in an effort to stay generic loses the potential to > > fully exploit the gpu. At one point the performance of the drivers from > > Nvidia was not equivalent, but I think that's been fixed. (This does not > > mean all vendors are unilaterally doing a good job) > > This is very interesting news. As far as I know, nobody is doing > anything with OpenCL in the College of Chemistry around here. > On the other hand, we've been following all the press about how > it's going to be the great unifier so that it won't be necessary > to use a proprietary API such as CUDA anymore. At this point it's too > early to doing anything with OpenCL until our colleagues in > the Computer Science department have made a pass at it and > have experiences to talk about. > People are starting to work with OpenCL but I don't think that it's ready yet. The nvidia implementation is still buggy and not up to par against cuda in terms of performance. Code is longer and more tedious (mostly matches the nvidia driver model instead of the much easier to use c api). I know that although NVidia say that they fully support it, they don't like it too much. NVidia techs told me that the performance difference can be about 1:2. Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of OpenCL was released December 2008 and they started working on 1.1 immediately after that. It has also been broken almost from the start due to too many companies controling it (it's designed by a consortium) and trying to solve the problem for too many scenarios at the same time. ATI also started supporting OpenCL but I don't have any experience with that. Their upside is that it also allows compiling cpu versions. I would start with cuda as the move to OpenCL is very simple afterwords if you wish and Cuda is easier to start with. Also note that OpenCL gives you functional portability but not performance portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc. The vectorization should be all different (NVidia discourage vectorization, ATI require vectorization, SSE requires different vectorization), the memory model is different, the size of the work groups should be different, etc. > > Have you considered sharing access with another research lab that has > > already purchased something similar? > > (Some vendors may also be willing to let you run your codes in exchange > > for feedback.) > > There's nobody else at UC Berkeley I know of who has a GPU > cluster. > > I don't know of any vendor who'd be willing to volunteer > their cluster. If anybody would like to volunteer, step > right up. > Are you aware of the NVidia professor partnership program? We got a Tesla S1070 for free from them. http://www.nvidia.com/page/professor_partnership.html > > 1) sw thread synchronization chews up processor time > > Right, but let's say right now 80% of the CPU time is spent > in routines that will eventually be done in the GPU (I'm > just making this number up). I don't see how having a faster > CPU would help overall. > My experience is that unless you wish to write hybrid code (code that partly runs on the GPU and partly on the CPU in parallel to fully utilize the system) you don't need to care too much about the CPU power. Note that the Cuda model is asynchronous so you can run code in parallel between the GPU and CPU. > > 2) Do you already know if your code has enough computational complexity > > to outweigh the memory access costs? > > In general, yes. A couple of grad students have ported some > of their code to CUDA with excellent results. Plus, molecular > dynamics is well suited to GPU programming, or so I'm told. > Several of the popular opensource MD packages have already > been ported also with excellent results. > The issue is not only computation complexity but also regular memory accesses. Random memory accesses on the GPU can seriously kill you performance. Also note that until fermi comes out the double precision performance is horrible. If you can't use single precision then GPUs are probably not for you at the moment. Double precision on g200 is around an 1/8 of single precision and g80/g90 don't have double precision at all. Fermi improves that by finally providing double precision running an 1/2 the single precision speed (basically combining two FPUs into on double precision unit). > > 3) Do you know if the GTX275 has enough vram? Your benchmarks will > > suffer if you start going to gart and page faulting > You don't have page faulting on the GPU, GPUs don't have virtual memory. If you don't have enough memory the allocation will just fail. > The one I mentioned in my posting has 1.8GB of RAM. If this isn't > enough then we're in trouble. The grad student I mentioned > has been using the 898MB version of this card without problems. > > > 4) I can tell you 100% that not all gpu are created equally when it > > comes to handling cuda code. I don't have experience with the GTX275, > > but if you do hit issues I would be curious to hear about them. > > I've heard that it's much better than the 9500GT that we first > started using. Since the 9500GT is a much cheaper card we didn't expect > much performance out of it, but the grad student who was trying > to use it said that there were problems with it not releasing memory, > resulting in having to reboot the host. I don't know the details. > I don't have any issues with releasing memory. The big differences are between the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the g200 which uses the 1.3 cuda model. Memory handling is much better on the 1.3 GPUs (memory accesses for fully utilizing the memory bandwidth are much more lenient). The g200 also has double precision support (although at about 1/8 the speed of single precision). There is also more support for atomic operations and a few other differences, although the biggest difference is the memory bandwidth utilization. Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but not so much for deployment. > > Some questions in return.. > > Is your code currently C, C++ or Fortran? > > The most important program for this group is in Fortran. > We're going to keep it in Fortran, but we're going to > write C interfaces to the routines that will run on > the GPU, and then write these routines in C. > You may want to look into the pgi compiler. They introduced Cuda support for Fortran, I believe since November. http://www.pgroup.com/resources/cudafortran.htm > > Is there any interest in optimizations at the compiler level which could > > benefit molecular dynamics simulations? > > Of course, but at what price? I'm talking both about > both the price in dollars, and the price in non-standard > directives. > > I'm not a chemist so I don't know what would speed up MD calculations > more than a good GPU. > On the cpu side you can utilize SSE. You can also use single precision on the CPU along with SSE and good cache utilization to greatly speed up things also on the CPU. My personal experience though is that it's much harder to use such optimization on the CPU than on the GPU for most problems. > Cordially, From michf at post.tau.ac.il Sun Jan 31 11:17:41 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Sun, 31 Jan 2010 21:17:41 +0200 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B65C8B0.9060300@pathscale.com> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <4B64B83B.9080303@pathscale.com> <4B64DD37.50201@berkeley.edu> <20100131193358.395acc13@vivalunalitshi.luna.local> <4B65C8B0.9060300@pathscale.com> Message-ID: <20100131211741.13793738@vivalunalitshi.luna.local> On Sun, 31 Jan 2010 21:15:12 +0300 "C. Bergstr?m" wrote: > Micha Feigin wrote: > > On Sat, 30 Jan 2010 17:30:31 -0800 > > Jon Forrest wrote: > > [snip] > > > >> > > > > People are starting to work with OpenCL but I don't think that it's ready yet. > > The nvidia implementation is still buggy and not up to par against cuda in > > terms of performance. Code is longer and more tedious (mostly matches the > > nvidia driver model instead of the much easier to use c api). I know that > > although NVidia say that they fully support it, they don't like it too much. > > NVidia techs told me that the performance difference can be about 1:2. > > > That used to be true, but I thought they fixed that? (How old is your > information) >From Thursday ... (three days or so). Not personal though, I prefer Cuda. I've got a friend who's working with Prof. Amnon Barak of the Jerusalem university who created mosix to do something similar for the GPU and they are doing it with OpenCL. One example is that you can pass NULL as the workgroup size and the system should set an optimal workgroup size automatically. Turns out that NVidia sets it to 1. Anyone who knows NVidia knows how good that is ... > > Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of > > OpenCL was released December 2008 and they started working on 1.1 immediately > > after that. It has also been broken almost from the start due to too many > > companies controling it (it's designed by a consortium) and trying to solve the > > problem for too many scenarios at the same time. > > > The problem isn't too many companies.. It was IBM's cell requirements > afaik.. Thank god that's dead now.. It's also intel vs. nvidia vs. amd > > ATI also started supporting OpenCL but I don't have any experience with that. > > Their upside is that it also allows compiling cpu versions. > > > > I would start with cuda as the move to OpenCL is very simple afterwords if you > > wish and Cuda is easier to start with. > > > I would start with a directive based approach that's entirely more sane > than CUDA or OpenCL.. Especially if his code is primarily Fortran. I > think writing C interfaces so that you can call the GPU is a maintenance > nightmare and will not only be time consuming, but will later will make > optimizing the application *a lot* harder. (I say this with my gpu > compiler hat on and more than happy to go into specifics) My experience is that it will never be as good but I'll be happy to hear from personal experience by how much. I'm guessing that you are talking HMPP or something similar here. Just moving stuff to the GPU entails very little overhead and gives you much more control of the memory and communication handling. For stuff that needs shared memory and/or textures for good performance, you usually need the direct control anyway. No experience with HMPP though, I should probably test run it at some point. Personally I'd love to hear about specifics. > > Also note that OpenCL gives you functional portability but not performance > > portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc. > > The vectorization should be all different (NVidia discourage vectorization, ATI > > require vectorization, SSE requires different vectorization), the memory model > > is different, the size of the work groups should be different, etc. > > > Please look at HMPP and see if it may solve this.. Will do [... snip again ...] > > > > The issue is not only computation complexity but also regular memory accesses. > > Random memory accesses on the GPU can seriously kill you performance. > > > I think I mentioned memory accesses.. Are you talking about page faults > or what specifically? (My perspective is skewed and I may be using a > different term.) No, just random memory accesses, think lookup tables. LUTs are horrible performance wise on the GPU. If you can't get coalescing working for you, you can get a factor of 8 IIRC performance hit. [... snip once more ...] > > > > You don't have page faulting on the GPU, GPUs don't have virtual memory. If you > > don't have enough memory the allocation will just fail. > > > Whatever you want to label it at a hardware level nvidia cards *do* have > vram and the drivers *can* swap to system memory. They use two things > to deal with this a) hw based page fault mechanism and b) dma copying to > reduce cpu overhead. If you try to allocate more that's available on > the card yes it will probably just fail. (We are working on the > drivers) My point was about what happens between the context switches > of kernels. I'm not aware of intentional swapping done by nvidia. I've had issues with kernels dying due to lack of memory on my laptop which could have been solved had there been swapping. And I'm talking memory allocated from different processes, where for each process it does fit in memory. There is an issue with window vista/7 vs. xp where windows vista/7 decided to manage the GPU memory as virtual memory, but again, I'm not sure about actual swapping. I need to get updated on the exact details as I didn't test drive win 7 too much. I probably should ask one of the devtecs at NVidia what is done at the driver level. Pitty I didn't see this thread last week as there were a few of them around for a visit :( [ ...] > > > > My personal experience though is that it's much harder to use such optimization > > on the CPU than on the GPU for most problems. > > > CUDA/OpenCL and friends implicitly identify which areas can be > vectorized and then explicitly offload them. You are comparing > apple/oranges here.. > > Cuda/OpenCL do it explicitly actually. You have things like auto vectorization by the intel compiler but it's very limited in recognizing vectorization code. For anything big you need to vectorized manually. If you look at the OpenCL tutorials from ATI they tell you that you need to use float4 if you want the CPU code to vectorize using sse, it's not done implicitly. From gerry.creager at tamu.edu Sun Jan 31 12:31:40 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sun, 31 Jan 2010 14:31:40 -0600 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <20100131190648.6d91b2f7@vivalunalitshi.luna.local> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <20100131190648.6d91b2f7@vivalunalitshi.luna.local> Message-ID: <4B65E8AC.1010904@tamu.edu> I employ a GX285 in a dedicated remote-access graphics box for data-local visualization and run into some of these issues, too. More inline, but Micha has it right. Micha Feigin wrote: > On Sat, 30 Jan 2010 10:24:09 -0800 > Jon Forrest wrote: > >> On 1/30/2010 4:31 AM, Micha Feigin wrote: >> >>> It is recommended BTW, that you have at least the same amount of system memory >>> as GPU memory, so with tesla it is 4GB per GPU. >> I'm not going to get Teslas, for several reasons: >> >> 1) This is a proof of concept cluster. Spending $1200 >> per graphics card means that the GPUs alone, assuming >> 2 GPUs, would cost as much as a whole node with >> 2 consumer-grade cards. (See below) >> > > Be very very sure that consumer geforces can go in 1u boxes. It's not so much > the space as much as I'm skeptical with their ability of handling the thermal > issues. They are just not designed for this kind of work. I've had to go to 2u and eventually to larger boxes because of power supply and air-flow requirements. This is a big issue. > Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla with > the same chip) and are actively cooled, which means that you need to get air > flowing into the side fan. That's exactly why they put the tesla m and not the > c into those boxes. Depends on where you get your gx from. I've got one that claims to not be overclocked but also claims to be as fast as one that says it IS overclocked. Since I'm not yet interested enough to actually look at the onboard chip speeds, I don't know. However, the one I've now got is in a 4u with additional forced air in the case to support an overtemp problem we had that was primarily flow-related (extra fans in the 2u and 3u cases we tried). We've not wandered too far into GPGPU processing... our user community has not shown an interest in it, but for graphics, it's useful. > The geforce driver also throttles the card under load to solve thermal issues. I believe this depends on onboard temp monitoring. Again, sufficient airflow is your friend. > You will probably want to under clock the cards to the tesla spec and be sure > to monitor the thermal state. > > I know someone who works with 3 gtx295 in a desktop box and he initially had > some thermal shutdown issues with older drivers. I'm guessing that the newer > drivers just throttle the cards more aggressively under load. > >> 2) We know that the Fermi cards are coming out >> soon. If we were going to spend big bucks >> on GPUs, we'd wait for them. But, our funding >> runs out before the Fermis will be available. >> This is too bad but there's nothing I can do >> about it. >> > > Check out the mad scientist program, it's supposed to end today, but maybe if you talk to NVidia they can still get you into it (they are rather flexible, esspecially with universities, and they also offer if for companies) > http://www.nvidia.com/object/mad_science_promo.html > You can buy a current telsa (t10 core) and upgrade it for a fermi (t20 core) > when it comes out for the cost difference. May be more cost effective if you do > plan to build a fermi cluster later on. It is designed to upgrade to the same > line though (c, m or s) so you may want to consider now which one to go with. > >> See below for comments regarding CPUs and cores. >> >>> You use dedicated systems. Either one 1u pizza box for the CPU and a matched 1u >>> tesla s1070 pizza box which has 4 tesla GPUs >> Since my first post I've learned about the Supermicro boxes >> that have space for two GPUs >> (http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) . >> This looks like a good way to go for a proof-of-concept cluster. Plus, >> since we have to pay $10/U/month at the Data Center, it's a good >> way to use space. >> > > See my previous comment > >> The GPU that looks the most promising is the GeForce GTX275. >> (http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR) >> It has 1792MB of RAM and is only ~$300. I realize that there >> are better cards but for this proof-of-concept cluster we >> want to get the best bang for the buck. Later, after we've >> ported our programs, and have some experience optimizing them, >> then we'll consider something better, probably using whatever >> the best Fermi-based card is. >> >> The research group that will be purchasing this cluster does >> molecular dynamics simulations that usually take 24 hours or more >> to complete using quad-core Xeons. We hope to bring down this >> time substantially. >> >>> It doesn't have a swap in/swap out mechanism, so the way it may time share is >>> by alternating kernels as long as there is enough memory. Shouldn't be done for >>> HPC (same with CPU by the way due to numa/l2 cache and context switching >>> issues). >> Right. So this means 4 cores should be good enough for 2 GPUs. >> I wish somebody made a motherboard that would allow 6-core >> AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the >> motherboard might not be worth the cost. I'm not sure. >> >>> The processes will be sharing the pci bus though for communications so you may >>> prefer to setup the system as 1 job per machine or at least a round robin >>> scheduler. >> This is another reason not to go crazy with lots of cores. >> They'll be sitting idle most of the time, unless I also >> create queues for normal non-GPU jobs. >> >>> Take note that the s1070 is ~6k$ so you are talking at most two to three >>> machines here with your budget. >> Ha, ha!! ~$6K should get me two compute nodes, complete >> with graphics cards. gerry From hahn at mcmaster.ca Sun Jan 31 14:06:34 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun, 31 Jan 2010 17:06:34 -0500 (EST) Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B65E8AC.1010904@tamu.edu> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <20100131190648.6d91b2f7@vivalunalitshi.luna.local> <4B65E8AC.1010904@tamu.edu> Message-ID: >> Be very very sure that consumer geforces can go in 1u boxes. It's not so >> much >> the space as much as I'm skeptical with their ability of handling the >> thermal >> issues. They are just not designed for this kind of work. > > I've had to go to 2u and eventually to larger boxes because of power supply > and air-flow requirements. This is a big issue. I'm a bit puzzled here. sumermicro sells servers that take either two M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow design seems at least thought-about, and their PSU is 1400W. C1060 specs merely say "200W max, 160W typical" - which is probably about the same as gtx275 according to wikipedia. so something like 600W expected from 1U - not really that hard, especially if you don't have a wall of 40u racks full of them... >> Note that geforces are overclocked (my gtx 285 by 30% compared to a tesla >> with >> the same chip) well, they're tuned differently: gf cards have substantially higher memory clocks and lower shader clocks. tesla has higher shader and substantially slower memory clocks (presumably because there are more loads on the bus.) >> and are actively cooled, which means that you need to get >> air >> flowing into the side fan. That's exactly why they put the tesla m and not >> the >> c into those boxes. why is this a problem with 1U? or do you really mean "double-wide cards don't provide enough clearance in 1U to get air to the card's intake"? -mark hahn From michf at post.tau.ac.il Sun Jan 31 16:31:30 2010 From: michf at post.tau.ac.il (Micha) Date: Mon, 01 Feb 2010 02:31:30 +0200 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <20100131190648.6d91b2f7@vivalunalitshi.luna.local> <4B65E8AC.1010904@tamu.edu> Message-ID: <4B6620E2.10505@post.tau.ac.il> On 01/02/2010 00:06, Mark Hahn wrote: >>> Be very very sure that consumer geforces can go in 1u boxes. It's not >>> so much >>> the space as much as I'm skeptical with their ability of handling the >>> thermal >>> issues. They are just not designed for this kind of work. >> >> I've had to go to 2u and eventually to larger boxes because of power >> supply and air-flow requirements. This is a big issue. > > I'm a bit puzzled here. sumermicro sells servers that take either two > M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow > design seems at least thought-about, and their PSU is 1400W. > The PSU is enough > C1060 specs merely say "200W max, 160W typical" - which is probably > about the same as gtx275 according to wikipedia. so something like 600W > expected from 1U - not really that hard, especially if you don't > have a wall of 40u racks full of them... > >>> Note that geforces are overclocked (my gtx 285 by 30% compared to a >>> tesla with >>> the same chip) > > well, they're tuned differently: gf cards have substantially higher memory > clocks and lower shader clocks. tesla has higher shader and substantially > slower memory clocks (presumably because there are more loads on the bus.) > yes, they're tuned differently, but because they are meant for different markets. gf are for the gamer market and are assumed to be run for several hours at a time, without too many of them in the machine fighting for airflow (gamer setup). Throttling is no big issue if needed. tesla is a server product that needs to run 24/7 for days/months without throttling (consistent output). Usually there are several of them in one machine (or shared quadro + tesla) Another issue is tolerance to memory errors. Higher temp/clock can cause more memory errors. These may cause small unnoticeable glitches for game graphics but will ruin hpc results. The two main issues taken into account for tuning is running time, and leniency to throttling. >>> and are actively cooled, which means that you need to get air >>> flowing into the side fan. That's exactly why they put the tesla m >>> and not the >>> c into those boxes. > > why is this a problem with 1U? or do you really mean "double-wide cards > don't provide enough clearance in 1U to get air to the card's intake"? > all the cards we are talking about are double wide. c1060 is actively cooled and is designed for a desktop pc. m1060 is passively cooled and designed for a 1u server. the c1060 assumes side air intake and rear exhaust. m1060 expects through flow and no external access to exhaust. Different design based on different airflow paradigms. I never built some systems so I'm not talking from experience but assumption (we are using c1060 in desktops and s1070 in servers). I'm not sure if a double wide card with side air intake in a 1u box allow any airflow to reach the air intake and thus the GPU. Maybe you can mod the card by taking the plastic off to improve airflow though. It looks from their site that they support double wide cards in their boxes so I guess that they tested the cooling. They definitely have more experience than me with such setups. I didn't say that it doesn't work, I just advised that you make sure as it sounded borderline to me and as noted previously by someone else, it has caused problem for people. > -mark hahn > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gerry.creager at tamu.edu Sun Jan 31 16:32:01 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sun, 31 Jan 2010 18:32:01 -0600 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <20100131190648.6d91b2f7@vivalunalitshi.luna.local> <4B65E8AC.1010904@tamu.edu> Message-ID: <4B662101.6080709@tamu.edu> Mark Hahn wrote: >>> Be very very sure that consumer geforces can go in 1u boxes. It's not >>> so much >>> the space as much as I'm skeptical with their ability of handling the >>> thermal >>> issues. They are just not designed for this kind of work. >> >> I've had to go to 2u and eventually to larger boxes because of power >> supply and air-flow requirements. This is a big issue. > > I'm a bit puzzled here. sumermicro sells servers that take either two > M1060's, or two C1060's, or two of any pcie 2 x16 cpus. their airflow > design seems at least thought-about, and their PSU is 1400W. > > C1060 specs merely say "200W max, 160W typical" - which is probably > about the same as gtx275 according to wikipedia. so something like 600W > expected from 1U - not really that hard, especially if you don't > have a wall of 40u racks full of them... A little over a year ago, a 1u 600w supply was a bit difficult to find for < $400, and additional fans for one required buying a specialty 1u case. I could have driven node price over $4k with CPUs, memory, a large onboard scratch, etc. I, too, was building a proof-of-concept box at the time. Now, it's used almost daily by several folks, and I'm thinking of building a new POC to house 4x gx's... And still no user interest in CUDA, and I don't have enough time to play without user interest, with my own research program that isn't computational science. >>> Note that geforces are overclocked (my gtx 285 by 30% compared to a >>> tesla with >>> the same chip) > > well, they're tuned differently: gf cards have substantially higher memory > clocks and lower shader clocks. tesla has higher shader and substantially > slower memory clocks (presumably because there are more loads on the bus.) Didn't realize. Thanks. >>> and are actively cooled, which means that you need to get air >>> flowing into the side fan. That's exactly why they put the tesla m >>> and not the >>> c into those boxes. > > why is this a problem with 1U? or do you really mean "double-wide cards > don't provide enough clearance in 1U to get air to the card's intake"? That's what *I* found, anyway. Yeah, what you said. Sorry, I thought that was obvious. When you turn one of those cards on its side, you do have trouble with card-width clearance. gerry From skylar at cs.earlham.edu Sun Jan 31 17:17:27 2010 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sun, 31 Jan 2010 17:17:27 -0800 Subject: [Beowulf] Anyone with really large clusters seeing memory leaks with OFED 1.5 for tcp based apps? In-Reply-To: <4B651750.8080307@scalableinformatics.com> References: <4B651750.8080307@scalableinformatics.com> Message-ID: <4B662BA7.7020206@cs.earlham.edu> Joe Landman wrote: > Hi folks > > Trying to trace something annoying down, and see if we are running > into something that is known. > > OFED 1.5 on a 2.6.30.10 kernel. Running a file system atop IPoIB > (many reasons, none I care to get into here at the moment). Under > light load, the file system gradually grabs memory. Possibly a leak, > not entirely sure. Could be the OFED stack underneath. Backing file > system is xfs. That is has been (on this hardware in other > situations) rock solid stable. Here, xfs, OFED/IPoIB all toss their > cookies (and fail allocations) under moderate to heavy load. > > Working with the file system vendor on this. I am not sure we have > the answer nailed, so I wanted to see who out there is running a big ( > >512 nodes) cluster, doing large data transfers (preferably over > IPoIB), for data storage, and running a late model OFED. If you fall > into this category, please let me know, as I'd like to ask a few > questions offline about any observed OFED/IPoIB failure modes. I am > not convinced it is OFED/IPoIB, but I'd like to see what other people > have run into ... if anything. > > Thanks! We're running at OFED 1.4 for our GPFS cluster, with RDMA used for data and IPoIB used for metadata and backups. We're looking at an upgrade to 1.5 so if you do find anything out I'd be very interested in knowing. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From cbergstrom at pathscale.com Sat Jan 30 14:52:43 2010 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Sun, 31 Jan 2010 01:52:43 +0300 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B647949.9030700@berkeley.edu> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> Message-ID: <4B64B83B.9080303@pathscale.com> Jon Forrest wrote: > On 1/30/2010 4:31 AM, Micha Feigin wrote: > >> It is recommended BTW, that you have at least the same amount of >> system memory >> as GPU memory, so with tesla it is 4GB per GPU. > > I'm not going to get Teslas, for several reasons: > > 1) This is a proof of concept cluster. Spending $1200 > per graphics card means that the GPUs alone, assuming > 2 GPUs, would cost as much as a whole node with > 2 consumer-grade cards. (See below) > > 2) We know that the Fermi cards are coming out > soon. If we were going to spend big bucks > on GPUs, we'd wait for them. But, our funding > runs out before the Fermis will be available. > This is too bad but there's nothing I can do > about it. > > See below for comments regarding CPUs and cores. > >> You use dedicated systems. Either one 1u pizza box for the CPU and a >> matched 1u >> tesla s1070 pizza box which has 4 tesla GPUs > > Since my first post I've learned about the Supermicro boxes > that have space for two GPUs > (http://www.supermicro.com/products/system/1U/6016/SYS-6016GT-TF.cfm?GPU=) > . > This looks like a good way to go for a proof-of-concept cluster. Plus, > since we have to pay $10/U/month at the Data Center, it's a good > way to use space. > > The GPU that looks the most promising is the GeForce GTX275. > (http://www.evga.com/products/moreInfo.asp?pn=017-P3-1175-AR) > It has 1792MB of RAM and is only ~$300. I realize that there > are better cards but for this proof-of-concept cluster we > want to get the best bang for the buck. Later, after we've > ported our programs, and have some experience optimizing them, > then we'll consider something better, probably using whatever > the best Fermi-based card is. > > The research group that will be purchasing this cluster does > molecular dynamics simulations that usually take 24 hours or more > to complete using quad-core Xeons. We hope to bring down this > time substantially. > >> It doesn't have a swap in/swap out mechanism, so the way it may time >> share is >> by alternating kernels as long as there is enough memory. Shouldn't >> be done for >> HPC (same with CPU by the way due to numa/l2 cache and context switching >> issues). > > Right. So this means 4 cores should be good enough for 2 GPUs. > I wish somebody made a motherboard that would allow 6-core > AMD Istanbuls, but they don't. Putting 2 4-cores CPUs on the > motherboard might not be worth the cost. I'm not sure. > >> The processes will be sharing the pci bus though for communications >> so you may >> prefer to setup the system as 1 job per machine or at least a round >> robin >> scheduler. > > This is another reason not to go crazy with lots of cores. > They'll be sitting idle most of the time, unless I also > create queues for normal non-GPU jobs. > >> Take note that the s1070 is ~6k$ so you are talking at most two to three >> machines here with your budget. > > Ha, ha!! ~$6K should get me two compute nodes, complete > with graphics cards. > > I appreciate everyone's comments, and I welcome more. Hi Jon, I must emphasize what David Mathog said about the importance of the gpu programming model. My perspective (with hopefully not too much opinion added) OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more tedious to write and in an effort to stay generic loses the potential to fully exploit the gpu. At one point the performance of the drivers from Nvidia was not equivalent, but I think that's been fixed. (This does not mean all vendors are unilaterally doing a good job) HMPP and everything else I'm far too biased to offer my comments publicly. (Feel free to email me offlist if curious) Have you considered sharing access with another research lab that has already purchased something similar? (Some vendors may also be willing to let you run your codes in exchange for feedback.) I'd not completely disregard the importance of the host processor. 1) sw thread synchronization chews up processor time 2) Do you already know if your code has enough computational complexity to outweigh the memory access costs? 3) Do you know if the GTX275 has enough vram? Your benchmarks will suffer if you start going to gart and page faulting 4) I can tell you 100% that not all gpu are created equally when it comes to handling cuda code. I don't have experience with the GTX275, but if you do hit issues I would be curious to hear about them. Some questions in return.. Is your code currently C, C++ or Fortran? Is there any interest in optimizations at the compiler level which could benefit molecular dynamics simulations? Best, ./Christopher From cbergstrom at pathscale.com Sun Jan 31 10:15:12 2010 From: cbergstrom at pathscale.com (=?UTF-8?B?IkMuIEJlcmdzdHLDtm0i?=) Date: Sun, 31 Jan 2010 21:15:12 +0300 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <20100131193358.395acc13@vivalunalitshi.luna.local> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <4B64B83B.9080303@pathscale.com> <4B64DD37.50201@berkeley.edu> <20100131193358.395acc13@vivalunalitshi.luna.local> Message-ID: <4B65C8B0.9060300@pathscale.com> Micha Feigin wrote: > On Sat, 30 Jan 2010 17:30:31 -0800 > Jon Forrest wrote: > > >> On 1/30/2010 2:52 PM, "C. Bergstr?m" wrote: >> >> >>> Hi Jon, >>> >>> I must emphasize what David Mathog said about the importance of the gpu >>> programming model. >>> >> I don't doubt this at all. Fortunately, we have lots >> of very smart people here at UC Berkeley. I have >> the utmost confidence that they will figure this >> stuff out. My job is to purchase and configure the >> cluster. >> >> >>> My perspective (with hopefully not too much opinion added) >>> OpenCL vs CUDA - OpenCL is 1/10th as popular, lacks in features, more >>> tedious to write and in an effort to stay generic loses the potential to >>> fully exploit the gpu. At one point the performance of the drivers from >>> Nvidia was not equivalent, but I think that's been fixed. (This does not >>> mean all vendors are unilaterally doing a good job) >>> >> This is very interesting news. As far as I know, nobody is doing >> anything with OpenCL in the College of Chemistry around here. >> On the other hand, we've been following all the press about how >> it's going to be the great unifier so that it won't be necessary >> to use a proprietary API such as CUDA anymore. At this point it's too >> early to doing anything with OpenCL until our colleagues in >> the Computer Science department have made a pass at it and >> have experiences to talk about. >> >> > > People are starting to work with OpenCL but I don't think that it's ready yet. > The nvidia implementation is still buggy and not up to par against cuda in > terms of performance. Code is longer and more tedious (mostly matches the > nvidia driver model instead of the much easier to use c api). I know that > although NVidia say that they fully support it, they don't like it too much. > NVidia techs told me that the performance difference can be about 1:2. > That used to be true, but I thought they fixed that? (How old is your information) > Cuda exists for 5 years (and another 2 internally in NVidia). Version 1 of > OpenCL was released December 2008 and they started working on 1.1 immediately > after that. It has also been broken almost from the start due to too many > companies controling it (it's designed by a consortium) and trying to solve the > problem for too many scenarios at the same time. > The problem isn't too many companies.. It was IBM's cell requirements afaik.. Thank god that's dead now.. > ATI also started supporting OpenCL but I don't have any experience with that. > Their upside is that it also allows compiling cpu versions. > > I would start with cuda as the move to OpenCL is very simple afterwords if you > wish and Cuda is easier to start with. > I would start with a directive based approach that's entirely more sane than CUDA or OpenCL.. Especially if his code is primarily Fortran. I think writing C interfaces so that you can call the GPU is a maintenance nightmare and will not only be time consuming, but will later will make optimizing the application *a lot* harder. (I say this with my gpu compiler hat on and more than happy to go into specifics) > Also note that OpenCL gives you functional portability but not performance > portability. You will not write the same OpenCL code for NVidia, ATI, CPUs etc. > The vectorization should be all different (NVidia discourage vectorization, ATI > require vectorization, SSE requires different vectorization), the memory model > is different, the size of the work groups should be different, etc. > Please look at HMPP and see if it may solve this.. > >>> Have you considered sharing access with another research lab that has >>> already purchased something similar? >>> (Some vendors may also be willing to let you run your codes in exchange >>> for feedback.) >>> >> There's nobody else at UC Berkeley I know of who has a GPU >> cluster. >> >> I don't know of any vendor who'd be willing to volunteer >> their cluster. If anybody would like to volunteer, step >> right up. >> >> > > Are you aware of the NVidia professor partnership program? We got a Tesla S1070 > for free from them. > > http://www.nvidia.com/page/professor_partnership.html > > >>> 1) sw thread synchronization chews up processor time >>> >> Right, but let's say right now 80% of the CPU time is spent >> in routines that will eventually be done in the GPU (I'm >> just making this number up). I don't see how having a faster >> CPU would help overall. >> >> > > My experience is that unless you wish to write hybrid code (code that partly > runs on the GPU and partly on the CPU in parallel to fully utilize the system) > you don't need to care too much about the CPU power. > > Note that the Cuda model is asynchronous so you can run code in parallel > between the GPU and CPU. > > >>> 2) Do you already know if your code has enough computational complexity >>> to outweigh the memory access costs? >>> >> In general, yes. A couple of grad students have ported some >> of their code to CUDA with excellent results. Plus, molecular >> dynamics is well suited to GPU programming, or so I'm told. >> Several of the popular opensource MD packages have already >> been ported also with excellent results. >> >> > > The issue is not only computation complexity but also regular memory accesses. > Random memory accesses on the GPU can seriously kill you performance. > I think I mentioned memory accesses.. Are you talking about page faults or what specifically? (My perspective is skewed and I may be using a different term.) > Also note that until fermi comes out the double precision performance is > horrible. If you can't use single precision then GPUs are probably not for you > at the moment. Double precision on g200 is around an 1/8 of single precision > and g80/g90 don't have double precision at all. > > Fermi improves that by finally providing double precision running an 1/2 the > single precision speed (basically combining two FPUs into on double precision > unit). > > >>> 3) Do you know if the GTX275 has enough vram? Your benchmarks will >>> suffer if you start going to gart and page faulting >>> > > You don't have page faulting on the GPU, GPUs don't have virtual memory. If you > don't have enough memory the allocation will just fail. > Whatever you want to label it at a hardware level nvidia cards *do* have vram and the drivers *can* swap to system memory. They use two things to deal with this a) hw based page fault mechanism and b) dma copying to reduce cpu overhead. If you try to allocate more that's available on the card yes it will probably just fail. (We are working on the drivers) My point was about what happens between the context switches of kernels. > >> The one I mentioned in my posting has 1.8GB of RAM. If this isn't >> enough then we're in trouble. The grad student I mentioned >> has been using the 898MB version of this card without problems. >> >> >>> 4) I can tell you 100% that not all gpu are created equally when it >>> comes to handling cuda code. I don't have experience with the GTX275, >>> but if you do hit issues I would be curious to hear about them. >>> >> I've heard that it's much better than the 9500GT that we first >> started using. Since the 9500GT is a much cheaper card we didn't expect >> much performance out of it, but the grad student who was trying >> to use it said that there were problems with it not releasing memory, >> resulting in having to reboot the host. I don't know the details. >> >> > > I don't have any issues with releasing memory. The big differences are between > the g80/g90 series (including the 9500GT) which is a 1.1 Cuda model and the > g200 which uses the 1.3 cuda model. > > Memory handling is much better on the 1.3 GPUs (memory accesses for fully > utilizing the memory bandwidth are much more lenient). The g200 also has double > precision support (although at about 1/8 the speed of single precision). There > is also more support for atomic operations and a few other differences, > although the biggest difference is the memory bandwidth utilization. > > Don't bother with the 8000 and 9000 for HPC and Cuda. Cheaper for learning but > not so much for deployment. > > >>> Some questions in return.. >>> Is your code currently C, C++ or Fortran? >>> >> The most important program for this group is in Fortran. >> We're going to keep it in Fortran, but we're going to >> write C interfaces to the routines that will run on >> the GPU, and then write these routines in C. >> >> > > You may want to look into the pgi compiler. They introduced Cuda support for > Fortran, I believe since November. > http://www.pgroup.com/resources/cudafortran.htm > Can anyone give positive feedback? (Disclaimer: I'm biased, but since we are making specific recommendations) http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36 > >>> Is there any interest in optimizations at the compiler level which could >>> benefit molecular dynamics simulations? >>> >> Of course, but at what price? I'm talking both about >> both the price in dollars, and the price in non-standard >> directives. >> >> I'm not a chemist so I don't know what would speed up MD calculations >> more than a good GPU. >> >> > > On the cpu side you can utilize SSE. You can also use single precision on the > CPU along with SSE and good cache utilization to greatly speed up things also > on the CPU. > > My personal experience though is that it's much harder to use such optimization > on the CPU than on the GPU for most problems. > CUDA/OpenCL and friends implicitly identify which areas can be vectorized and then explicitly offload them. You are comparing apple/oranges here.. From plegresl at gmail.com Sun Jan 31 11:45:49 2010 From: plegresl at gmail.com (Patrick LeGresley) Date: Sun, 31 Jan 2010 11:45:49 -0800 Subject: [Beowulf] Re: GPU Beowulf Clusters In-Reply-To: <201001311920.o0VJKUTg027827@bluewest.scyld.com> References: <201001311920.o0VJKUTg027827@bluewest.scyld.com> Message-ID: <7A9A3281-CCBD-413A-A0CC-102A0BD6388B@gmail.com> I've found this presentation from John Stone at SC09 to be a very good comparison of CUDA versus OpenCL performance on real code: > http://www.ks.uiuc.edu/Research/gpu/files/openclbof_stone2009.pdf My take away from this presentation, which matches my personal experience comparing the two, is that CUDA and OpenCL performance on NVIDIA hardware are within a few percent. Trying to use the same source code on hardware from different vendors obviously has the expected performance pitfalls. The biggest thing to watch out for may be performance regressions from one release of CUDA to the next, and even among slightly different driver versions. You can see an example of this from John on slide 17. Cheers, Patrick From chenyon1 at iit.edu Thu Jan 28 08:57:39 2010 From: chenyon1 at iit.edu (Yong Chen) Date: Thu, 28 Jan 2010 10:57:39 -0600 Subject: [Beowulf] [hpc-announce] Call For Papers: Intl. Workshop on Parallel Programming Models and Systems Software for HEC (P2S2) Message-ID: CALL FOR PAPERS =============== Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 13th, 2010 To be held in conjunction with ICPP-2010: The 39th International Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published in CD format and will be available at the conference (together with the ICPP conference proceedings) . SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers of P2S2'10 will be included in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Programming Models, Software and Tools for High-End Computing. IMPORTANT DATES --------------- Paper Submission: March 3rd, 2010 Author Notification: May 3rd, 2010 Camera Ready: June 14th, 2010 PROGRAM CHAIRS -------------- * Pavan Balaji, Argonne National Laboratory * Abhinav Vishnu, Pacific Northwest National Laboratory PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp, University of Illinois Urbana-Champaign * Dhabaleswar K. Panda, Ohio State University * Vijay Saraswat, IBM Research PROGRAM COMMITTEE ----------------- * Ahmad Afsahi, Queen's University * George Almasi, IBM Research * Taisuke Boku, Tsukuba University * Ron Brightwell, Sandia National Laboratory * Franck Cappello, INRIA, France * Yong Chen, Illinois Institute of Technology * Ada Gavrilovska, Georgia Tech * Torsten Hoefler, Indiana University * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Guillaume Mercier, INRIA, France * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, IBM Research * Rajeev Thakur, Argonne National Laboratory * Vinod Tipparaju, Oak Ridge National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Auburn University If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== If you do not want to receive any more announcements regarding the P2S2 workshop, please unsubscribe here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ========================================================================