From csamuel at vpac.org Mon Jun 1 04:08:18 2009 From: csamuel at vpac.org (Chris Samuel) Date: Mon, 1 Jun 2009 21:08:18 +1000 (EST) Subject: [Beowulf] syslog to sql In-Reply-To: <1901329362.6852681243854434739.JavaMail.root@mail.vpac.org> Message-ID: <530657448.6852701243854498744.JavaMail.root@mail.vpac.org> ----- "Michael Di Domenico" wrote: > Does anyone have any opinions on which (free) programs work best for > sucking syslog data into mysql? I had used syslog-ng in the past, > but it looks like they went commercial on the sql import side. How about rsyslog ? http://www.rsyslog.com/ It's in RHEL/CentOS 5.. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From rgb at phy.duke.edu Mon Jun 1 16:13:03 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 1 Jun 2009 19:13:03 -0400 (EDT) Subject: [Beowulf] Mailing list statistics In-Reply-To: References: <20090523233245.GA19791@artificial-flavours.csclub.uwaterloo.ca> Message-ID: On Wed, 27 May 2009, Ricardo Reis wrote: > > I wonder the percentage the R.A.B. (rgb army of bots) hold on their own... > > :) Recently, not so much. Too busy. But I do listen and sometimes chime in...;-) rgb > > Ricardo Reis > > 'Non Serviam' > > PhD candidate @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > & > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > http://www.flickr.com/photos/rreis/ Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From hearnsj at googlemail.com Tue Jun 2 01:14:36 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 2 Jun 2009 09:14:36 +0100 Subject: [Beowulf] Problems with nvidia driver In-Reply-To: References: Message-ID: <9f8092cc0906020114x1ddea360k79b8dd0fe940c42d@mail.gmail.com> 2009/5/26 Francesco Pietra : > Hi > Is it 3D feasible with NV11DDR (GeForce2 MX200 rev b2) cip and 32bit > linux (debian i386)? I met failure by doing You would be much better asking for support from Nvidia directly, or on the Nvidia forums www.nvnews.net I do not use Debian, so cannot help you directly. However, I have recently been configuring Nvidia cards quite a lot, with the latest drivers. It always takes some time to get it right. Recently I had the FX 5800 card on test (nice!). My advice is that if you have a very, very recent card then load the beta drivers from the Nvidia site, and run the script to build the modules. If you have less than a very, very recent card the SuSE linux has a superb one-click intstall. Jsut click on the install links on this page: http://en.opensuse.org/NVIDIA the Nvidia repositories are added to your system. Top marks to SuSE for that one. From mdidomenico4 at gmail.com Tue Jun 2 09:34:26 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 2 Jun 2009 12:34:26 -0400 Subject: [Beowulf] dedupe filesystem Message-ID: does anyone have an opinion on dedup'ing files on a filesystem, but not in the context of backups? I did a google search for a program, but only seemed to find the big players in the context of backups and block levels. i just need a file level check and report. Is scanning the filesystem and md5'ing the files really the best (or only) way to do this? From hearnsj at googlemail.com Tue Jun 2 10:18:09 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 2 Jun 2009 18:18:09 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: Message-ID: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> I'm quite interested in this discussion. De-duplication seems to be all the rage in 'corporate IT'. Surely the purpose is mainly to deal with storing many copies of the same emails (or Excel files... etc.) In HPC one would hope that all files are different, so de-duplication would not be a feature which you need. Of course, 'hoping' is not good enough - and I'm interested in what tools you find Michael. From ashley at pittman.co.uk Tue Jun 2 10:39:40 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 02 Jun 2009 18:39:40 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: Message-ID: <1243964380.30944.8.camel@localhost.localdomain> On Tue, 2009-06-02 at 12:34 -0400, Michael Di Domenico wrote: > does anyone have an opinion on dedup'ing files on a filesystem, but > not in the context of backups? I did a google search for a program, > but only seemed to find the big players in the context of backups and > block levels. i just need a file level check and report. I'm not sure I understand the question, if it's a case of looking for duplicate files on a filesystem I use fdupes http://premium.caribe.net/~adrian2/fdupes.html > Is scanning the filesystem and md5'ing the files really the best (or > only) way to do this? Fdupes scans the filesystem looking for files where the size matches, if it does it md5's them checking for matches and if that matches it finally does a byte-by-byte compare to be 100% sure. As a result it can take a while on filesystems with lots of duplicate files. There is another test it could do after checking the sizes and before the full md5, it could compare the first say Kb which should mean it would run quicker in cases where there are lots of files which match in size but not content but anyway I digress. Ashley Pittman. From niftyompi at niftyegg.com Tue Jun 2 11:51:57 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 2 Jun 2009 11:51:57 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: <1243964380.30944.8.camel@localhost.localdomain> References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: <20090602185157.GA2998@tosh2egg.wr.niftyegg.com> On Tue, Jun 02, 2009 at 06:39:40PM +0100, Ashley Pittman wrote: > On Tue, 2009-06-02 at 12:34 -0400, Michael Di Domenico wrote: > > does anyone have an opinion on dedup'ing files on a filesystem, but > > not in the context of backups? I did a google search for a program, > > but only seemed to find the big players in the context of backups and > > block levels. i just need a file level check and report. > > I'm not sure I understand the question, if it's a case of looking for > duplicate files on a filesystem I use fdupes > > http://premium.caribe.net/~adrian2/fdupes.html > > > Is scanning the filesystem and md5'ing the files really the best (or > > only) way to do this? > > Fdupes scans the filesystem looking for files where the size matches, if > it does it md5's them checking for matches and if that matches it > finally does a byte-by-byte compare to be 100% sure. As a result it can > take a while on filesystems with lots of duplicate files. > > There is another test it could do after checking the sizes and before > the full md5, it could compare the first say Kb which should mean it > would run quicker in cases where there are lots of files which match in > size but not content but anyway I digress. > > Ashley Pittman. > Not realy a digression.... this is a performance oriented list. Below is my back pocket solution for finding things like multiple copies of big .iso files. As you indicate it could be dog slow. The very hard part is knowing what to do once a duplicate has been found so I look at all the duplicates with less. Another difficult part might be meta characters in file names thus the print0. #! /bin/bash # find-duplicate -- released GPL SIZER=' -size +10240k' #SIZER="" DIRLIST=". " find $DIRLIST -type f $SIZER -print0 | xargs -0 md5sum |\ egrep -v "d41d8cd98f00b204e9800998ecf8427e|LemonGrassWigs" |\ sort > /tmp/looking4duplicates cat /tmp/looking4duplicates | uniq --check-chars=32 --all-repeated=prepend | less -- T o m M i t c h e l l Found me a new hat, now what? From matt at technoronin.com Tue Jun 2 21:56:54 2009 From: matt at technoronin.com (Matt Lawrence) Date: Tue, 2 Jun 2009 23:56:54 -0500 (CDT) Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> Message-ID: On Tue, 2 Jun 2009, John Hearns wrote: > I'm quite interested in this discussion. > De-duplication seems to be all the rage in 'corporate IT'. > Surely the purpose is mainly to deal with storing many copies of the > same emails (or Excel files... etc.) I have found a great deaal of duplication in install trees, particularly when you just want to install the latest. I've managed to get some massive savings with NIM on AIX and some lesser but stll very good savings with CentOS by building parallel trees and hard linking the files. -- Matt It's not what I know that counts. It's what I can remember in time to use. From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jun 3 02:54:38 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 3 Jun 2009 11:54:38 +0200 (CEST) Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> Message-ID: On Tue, 2 Jun 2009, John Hearns wrote: > In HPC one would hope that all files are different, so > de-duplication would not be a feature which you need. I beg to differ, at least in the academic environment where I come from. Image these 2 scenarios: 1. copying between users step1: PhD student does a good job, produces data and writes thesis, then leaves the group but keeps data around because the final paper is still not written; in his/her new position, there's no free time so the final paper advances very slowly; however data can't be taken on a slow medium because it's still actively worked on step2: another PhD student takes over the project and does something else that needs the data, so a copy is created(*). step3: a short term practical work by an undergraduate student which colaborates with the step2 PhD student needs access to the data; as the undergrad student is not trusted (he/she can make mistakes that delete/modify the data), another copy is created (*) copies are created for various reasons: privacy or intelectual property - people protect their data using Unix file access rights or ACLs, the copying is done with their explicit consent, either by them or by the sysadmin. fear of change - people writing up (or hoping to) don't want their data to change, so that they can f.e. go back and redo the graph that the reviewer asked for. They are particularly paranoid about their data and would prefer copying than allowing other people to access it directly. lazyness - there can be technical solutions for the above 2 reasons, but if the people involved don't want to make the effort to use them, copying seems like a much easier solution. 2. copying between machines Data is stored on a group file server or on the cluster where is was created, but needs to be copied somewhere else for a more efficient (mostly from I/O point of view) analysis. A copy is made, but later on people don't remember why the copy was made and if there was any kind of modification to the data. Sometimes the results of the analysis (which can be very small compared with the actual data) are stored there as well, making the whole set look like a "package" worthy of being stored together, independent of the original data. This "package" can be copied back (so the two copies live in the same file system) or can remain separate (which can make it harder to detect as copies). I do mean all these in a HPC environment - the analysis mentioned before can involve reading multiple times files ranging from tens of GB to TB (for the moment...). Even if the analysis itself doesn't run as a parallel job, several (many) such jobs can run at the same time looking for different parameters. [ the above scenarios actually come from practice - not imagination - and are written with molecular dynamics simulations in mind ] Also don't forget backup - a HPC resource is usually backed up, to avoid loss of data which was obtained with precious CPU time (and maybe an expensive interconnect, memory, etc). -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From hearnsj at googlemail.com Wed Jun 3 03:10:05 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 3 Jun 2009 11:10:05 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> Message-ID: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> 2009/6/3 Bogdan Costescu : > > I beg to differ, at least in the academic environment where I come from. > Image these 2 scenarios: > > 1. copying between users > You make good points, and I agree. > ? ? ? ?> 2. copying between machines > > ? Data is stored on a group file server or on the cluster where is > ? was created, but needs to be copied somewhere else for a more > ? efficient (mostly from I/O point of view) analysis. A copy is made, > ? but later on people don't remember why the copy was made and if > ? there was any kind of modification to the data. Sometimes the > ? results of the analysis (which can be very small compared with the > ? actual data) are stored there as well, making the whole set look > ? like a "package" worthy of being stored together, independent of > ? the original data. This "package" can be copied back (so the two > ? copies live in the same file system) or can remain separate (which > ? can make it harder to detect as copies). This is something we should explore on this list. Quite often the architecture of storage is a secondary consideration, in the rush to get a Shiny New Fast machine on site and working. In HPC, there are a lot of advantages in a central clustered filesystem, where you can prepare your input data, run the simulation, then at the end visualize the data. I do agree with you that there are situations where you transfer the data to faster storage before running on it - I am thinking on one particular case right now! I Also agree with you that you then have the danger of 'squirreling away' copies on the fast storage, and forgetting why they are there. The systems administrator must put in place strong policies on this - leave your data on the fast storage, it gets deleted after N weeks. > > I do mean all these in a HPC environment - the analysis mentioned before can > involve reading multiple times files ranging from tens of GB to TB (for the > moment...). Even if the analysis itself doesn't run as a parallel job, > several (many) such jobs can run at the same time looking for different > parameters. [ the above scenarios actually come from practice - not > imagination - and are written with molecular dynamics simulations in mind ] > > Also don't forget backup - a HPC resource is usually backed up, to avoid > loss of data which was obtained with precious CPU time (and maybe an > expensive interconnect, memory, etc). > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.costescu at iwr.uni-heidelberg.de > From stewart at serissa.com Wed Jun 3 05:24:19 2009 From: stewart at serissa.com (Lawrence Stewart) Date: Wed, 3 Jun 2009 08:24:19 -0400 Subject: [Beowulf] Re: dedupe Filesystem Message-ID: <20D2BC0D-6686-4AE6-BBFA-28E49AC13686@serissa.com> I know a little bit about this from a time before SiCortex. The big push for deduplication came from disk-to-disk backup companies. As you can imagine, there is a huge advantage for deduplication if the problem you are trying to solve is backing up a thousand desktops. For that purpose, whole file duplicate detection works great. The next big problem is handling incremental backups. Making them run fast is important. And some applications, um, Outlook, have huge files (PST files) that change in minor ways every time you touch them. The big win here is the ability to detect and handle duplication at the block or sub-block level. This can have enormous performance advantages for incremental backups of those 1000 huge PST files. The technology for detecting sub-block level duplication is called "Shingling" or "rolling hashes" and was invented by Rivest (big surprise I guess!) and perhaps Mark Manasse. It is wicked clever stuff. The same schemes are used now for finding plagiarism among pages on the internet. I probably don't need to remind anyone here that deduplication on a live filesystem (as opposed to backups) can have really bad performance effects. Imagine if you have to move the disk arms around for every file for every block of every file. Modern filesystems do well at keeping files contiguous and often keep all the files of a directory nearly. This locality gets trashed by deduplicaiton. This won't matter if the problem is making backups smaller or making incrementals run faster, but it is not good for the performance of a live filesystem. -Larry/thinking about what to do next From mdidomenico4 at gmail.com Wed Jun 3 05:42:20 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 3 Jun 2009 08:42:20 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <1243964380.30944.8.camel@localhost.localdomain> References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: On Tue, Jun 2, 2009 at 1:39 PM, Ashley Pittman wrote: > I'm not sure I understand the question, if it's a case of looking for > duplicate files on a filesystem I use fdupes > > http://premium.caribe.net/~adrian2/fdupes.html Fdupes is indeed the type of app i was looking for. I did run into one catch with it though, on first run it trounced down into a NetApp snapshot directory. Dupes galore... It would be nice if it kept a log too, so that if the files are the same on a second go around it didn't have to md5 every file all over again. From mdidomenico4 at gmail.com Wed Jun 3 05:55:52 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed, 3 Jun 2009 08:55:52 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> Message-ID: On Wed, Jun 3, 2009 at 6:10 AM, John Hearns wrote: > I do agree with you that there are situations where you transfer the > data to faster storage before running on it - > I am thinking on one particular case right now! > I Also agree with you that you then have the danger of 'squirreling > away' copies on the fast storage, and forgetting why they are there. > The systems administrator must put in place strong policies on this - > leave your data on the fast storage, it gets deleted after N weeks. Do you find such a policy hard to enforce with researchers? I don't have tiered storage today, but in the future i can see a need to have a storage pool with SATA and a storage pool with SAS or faster drives in it. Some of the researchers where I am, work on data for months. Is this something better solved with pre/post-amble copies or through policies? From landman at scalableinformatics.com Wed Jun 3 06:18:19 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 03 Jun 2009 09:18:19 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: <4A26781B.1020907@scalableinformatics.com> It might be worth noting that dedup is not intended for high performance file systems ... the cost of computing the hash(es) is(are) huge. Dedup is used *primarily* to prevent filling up expensive file systems of limited size (e.g. SAN units with "fast" disks). For this crowd, 20-30TB is a huge system, and very expensive. Dedup (in theory) makes these file systems have a greater storage density, and also allows for faster DR, faster backup, and whatnot else ... assuming that Dedup is meaningful for the files stored. Its fine for slower directories, but the costs to Dedup usually involve a hardware or software layer which isn't cheap. Arguably, Dedup is more of a tactical effort on the part of the big storage vendors to reduce the outflow of their customers to less expensive storage modalities and products. It works well in some specific cases (with lots of replication), and poorly in many others. Think of trying to zip up a binary file with very little in the way of repeating patterns. Dedup is roughly akin to RLE encoding, with a shared database of blocks, using hash keys to represent specific blocks. If your data has lots of these identical blocks, then Dedup can save you lots of space. Point that block to the dictionary with the hash key, and when you read that block, pull it from the dictionary. This is how many of the backup folks get their claimed 99% compression BTW. They don't get this in general for a random collection of different files. They would get it for files that Dedup software can compress. Another technique is storing original and diffs, or the current, and backward diffs. So if you have a block that differs in two characters, point to the original and a diff. The problem with this (TANSTAAFL) is that your dictionary (hash->block lookup) becomes your bottleneck (probably want a real database for this), and that this can fail spectacularly in the face of a) high data rates, b) minimal file similarity, c) many small operations on files. If you have Dedup anywhere, in your backup would be good. Just my $0.02. Joe Michael Di Domenico wrote: > On Tue, Jun 2, 2009 at 1:39 PM, Ashley Pittman wrote: >> I'm not sure I understand the question, if it's a case of looking for >> duplicate files on a filesystem I use fdupes >> >> http://premium.caribe.net/~adrian2/fdupes.html > > Fdupes is indeed the type of app i was looking for. I did run into > one catch with it though, on first run it trounced down into a NetApp > snapshot directory. Dupes galore... > > It would be nice if it kept a log too, so that if the files are the > same on a second go around it didn't have to md5 every file all over > again. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From deadline at eadline.org Wed Jun 3 06:40:08 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 3 Jun 2009 09:40:08 -0400 (EDT) Subject: [Beowulf] Webinar: Optimizing the Nehalem for HPC Message-ID: <46332.192.168.1.213.1244036408.squirrel@mail.eadline.org> I am moderating a webinar (1PM Eastern) tomorrow called: Optimizing the Nehalem for HPC We have two speakers that have hands-on experience with testing Nehalem and HPC. THey will pesent some benchmarks and tips for HPC. Here is the link to sign up: http://www.linux-mag.com/id/7328 -- Doug From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jun 3 08:37:01 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 3 Jun 2009 17:37:01 +0200 (CEST) Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A26781B.1020907@scalableinformatics.com> References: <1243964380.30944.8.camel@localhost.localdomain> <4A26781B.1020907@scalableinformatics.com> Message-ID: On Wed, 3 Jun 2009, Joe Landman wrote: > It might be worth noting that dedup is not intended for high > performance file systems ... the cost of computing the hash(es) > is(are) huge. Some file systems do (or claim to do) checksumming for data integrity purposes, this seems to me like the perfect place to add the computation of a hash - with data in cache (needed for checksumming anyay), the computation should be fast. This would allow runtime detection of duplicates, but would make detection of duplicates between file systems or for backup more cumbersome as the hashes would need to be exported somehow from the file system. One issue that was not mentioned yet is the strength/length of the hash - within one file system, it's known what are the limitations of number of blocks, files, file sizes, etc. and the hash can be chosen such that there are no collisions. By taking an arbitrarily large number of blocks/files as can be available on a machine or network with many large devices or file systems, the same guarantee doesn't hold anymore. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jun 3 20:33:32 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 4 Jun 2009 05:33:32 +0200 (CEST) Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> Message-ID: On Wed, 3 Jun 2009, John Hearns wrote: > Quite often the architecture of storage is a secondary consideration, > in the rush to get a Shiny New Fast machine on site and working. Well, I've seen it ignored even outside of that rush - in the design phase. And I confess of being guilty of doing this as well, but I learn from mistakes :-) > In HPC, there are a lot of advantages in a central clustered > filesystem, where you can prepare your input data, run the > simulation, then at the end visualize the data. In theory, this sounds nice, but in practice it can prove to be a bit more difficult, most times the human factor being the main culprit (just like with the scenarios I presented earlier). Administrative issues (who owns what) can seriously affect the possibility of coupling the HPC and visualisation resources, leading often to duplication of data. Stupid sysadmins or policies can leave the HPC resource with very basic text editors or terminal settings, leading the users to create the input set on their own workstation and constantly copying it over. Only the actual running of the simulation can be tightly linked to the clustered file system... > The systems administrator must put in place strong policies on this - > leave your data on the fast storage, it gets deleted after N weeks. I see duplication of data in almost all cases as a human behaviour problem, not a technical one, which needs human behaviour solutions and not technical ones, so policies are a good solution. But I would argue that users' education is an even better solution - teach them why copying of data is bad and give them easy ways of safely sharing data with their collaborators and not only they will keep the file systems emptier but they will also thank you for the decrease in effort needed to manage the always increasing amounts of data. Such a solution is however only feasible for smaller groups - f.e. an HPC center offering services to several (many) universities won't be able to convince all its users to take time and think about data management and a virtual sucker rod is not as efficient as a real one, so policies would still be required... -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From niftyompi at niftyegg.com Thu Jun 4 11:53:47 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Thu, 4 Jun 2009 11:53:47 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> Message-ID: <20090604185347.GA3088@tosh2egg.wr.niftyegg.com> On Thu, Jun 04, 2009 at 05:33:32AM +0200, Bogdan Costescu wrote: > On Wed, 3 Jun 2009, John Hearns wrote: > >> Quite often the architecture of storage is a secondary consideration, >> in the rush to get a Shiny New Fast machine on site and working. > > Well, I've seen it ignored even outside of that rush - in the design > phase. And I confess of being guilty of doing this as well, but I learn > from mistakes :-) > ..... > > I see duplication of data in oalmost all cases as a human behaviour > problem, not a technical one, which needs human behaviour solutions and > not technical ones, so policies are a good solution. Take this list for example. We each get our own copy and at times get multiple copies as a side effect of replies. One key here is the lack of 'caching' tools for mail and for HPC in the I/O filesystem space. There are multiple issues that make this hard, some technical some social, some habititual. On the habitual side, I was recently looking at a CS homework assignment and noticed that the primary instruction began with "copy" both code and "data" and then ended with "copy code and data" to submit the homework assignment result. The low budget answer today is a human behaviour solution... longer term solutions will need to understand the "data flow" and "data state" of a lot of replicated things (example mail and attachments) a lot better including the "off line" state, multiple keyboards (home/ work) and connectivity and connectivity quality state. It is possible that HPC tools and mail could evolve toward a Mecurial view (revision control) of data. This in turn implies a longer reach for access control and access policy tools. -- T o m M i t c h e l l Found me a new hat, now what? From kilian.cavalotti.work at gmail.com Fri Jun 5 05:29:50 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Fri, 5 Jun 2009 14:29:50 +0200 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> Message-ID: <200906051429.51888.kilian.cavalotti.work@gmail.com> On Wednesday 03 June 2009 14:55:52 Michael Di Domenico wrote: > Do you find such a policy hard to enforce with researchers? I don't > have tiered storage today, but in the future i can see a need to have > a storage pool with SATA and a storage pool with SAS or faster drives > in it. Some of the researchers where I am, work on data for months. > Is this something better solved with pre/post-amble copies or through > policies? The best of both worlds would certainly be a central, fast storage filesystem, coupled with a hierarchical storage management system. Oh wait, it might exist already... Well, at least it's in the works: Sun and CEA are working on implementing such an HSM for Lustre 2.0. See http://wiki.lustre.org/images/8/8b/AurelienDegremont.pdf for details. Cheers, -- Kilian From kilian.cavalotti.work at gmail.com Fri Jun 5 05:39:24 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Fri, 5 Jun 2009 14:39:24 +0200 Subject: [Beowulf] Re: dedupe Filesystem In-Reply-To: <20D2BC0D-6686-4AE6-BBFA-28E49AC13686@serissa.com> References: <20D2BC0D-6686-4AE6-BBFA-28E49AC13686@serissa.com> Message-ID: <200906051439.24812.kilian.cavalotti.work@gmail.com> On Wednesday 03 June 2009 14:24:19 Lawrence Stewart wrote: > I probably don't need to remind anyone here that deduplication on a > live filesystem (as opposed to backups) can have really bad > performance effects. Imagine if you have to move the disk arms around > for every file for every block of every file. Modern filesystems do > well at keeping files contiguous and often keep all the files of a > directory nearly. This locality gets trashed by deduplicaiton. This > won't matter if the problem is making backups smaller or making > incrementals run faster, but it is not good for the performance of a > live filesystem. It's kind of new to me, but it looks like some vendors have block-level deduplication systems integrated in some of their product lines. See http://www.theregister.co.uk/2009/06/05/dell_block_dedupe/ or http://www.theregister.co.uk/2009/06/01/quantum_extends_dxi/ > -Larry/thinking about what to do next :( Cheers, -- Kilian From hearnsj at googlemail.com Fri Jun 5 06:32:04 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 14:32:04 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: <200906051429.51888.kilian.cavalotti.work@gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <9f8092cc0906050632s6a356bf9re5a9d2384aed55ef@mail.gmail.com> 2009/6/5 Kilian CAVALOTTI : >> > Oh wait, it might exist already... Well, at least it's in the works: Sun and > CEA are working on implementing such an HSM for Lustre 2.0. See > http://wiki.lustre.org/images/8/8b/AurelienDegremont.pdf for details. > That looks interesting, thankyou. The Robinhood tool for looking at filesystem usage and automatic purging looks good to me. From hahn at mcmaster.ca Fri Jun 5 06:52:55 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 5 Jun 2009 09:52:55 -0400 (EDT) Subject: [Beowulf] dedupe filesystem In-Reply-To: <200906051429.51888.kilian.cavalotti.work@gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: >> have tiered storage today, but in the future i can see a need to have >> a storage pool with SATA and a storage pool with SAS or faster drives >> in it. IMO, this is a dubious assertion. I bought a couple incredibly cheap desktop disks for home use a couple weeks ago: just seagate 7200.12's. these are of the latest 500G/platter generation, so have the high density and thus bandwidth: http://www.sharcnet.ca/~hahn/7200.12.png sure, your application may require low-latency. but bandwidth is easy. >> Some of the researchers where I am, work on data for months. my organization's current policy is to be fairly stingy with /home and /work, neither of which have any timeouts. /scratch currently has a 1-month timeout, which unfortunately tends to be too short to encourage use. >> Is this something better solved with pre/post-amble copies or through >> policies? we currently have a periodic crawler that collects data on each filesystem: hashing each file to avoid people gaming timeouts with touch. > The best of both worlds would certainly be a central, fast storage filesystem, > coupled with a hierarchical storage management system. I'm not sure - is there some clear indication that one level of storage is not good enough? > Oh wait, it might exist already... Well, at least it's in the works: Sun and > CEA are working on implementing such an HSM for Lustre 2.0. See > http://wiki.lustre.org/images/8/8b/AurelienDegremont.pdf for details. this seems like a bad design to me. I would think (and I'm reasonably familiar with Lustre, though not an internals expert) that if you're going to touch Lustre interfaces at all, you should simply add cheaper, higher-density OSTs, and make more intelligent placement/migration heuristics. I guess that CEA already has a vast investment in some existing HSM, so can't do this. regards, mark hahn From kilian.cavalotti.work at gmail.com Fri Jun 5 07:09:54 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Fri, 5 Jun 2009 16:09:54 +0200 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <200906051609.54829.kilian.cavalotti.work@gmail.com> On Friday 05 June 2009 15:52:55 Mark Hahn wrote: > > The best of both worlds would certainly be a central, fast storage > > filesystem, coupled with a hierarchical storage management system. > > I'm not sure - is there some clear indication that one level of storage is > not good enough? I guess it strongly depends on your workload and applications. If your users tend to keep all their files for long-term purposes, as Bogdan Costescu pertinently described earlier, it might make sense to transparently free up the fast centralized filesystem and move the unused-at-the-moment-but-still- crucially-important files to a slower, farther filesystem (or tapes). This way, you have more fast storage space available for running jobs, while keeping the convenience for users to still be able to access their archived files transparently, as if they still were on the filesystem. It's a nice feature to have because it makes users life easier. Obviously, if you don't already have this kind of second level storage infrastructure, the benefit is maybe not worth the investment. > this seems like a bad design to me. I would think (and I'm reasonably > familiar with Lustre, though not an internals expert) that if you're going > to touch Lustre interfaces at all, you should simply add cheaper, > higher-density OSTs, and make more intelligent placement/migration > heuristics. In Lustre, that would be done through OST pools. Eh, isn't this also a feature CEA contributed to? :) Cheers, -- Kilian From hearnsj at googlemail.com Fri Jun 5 07:21:59 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 15:21:59 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> 2009/6/5 Mark Hahn : > I'm not sure - is there some clear indication that one level of storage is > not good enough? That is well worthy of a debate. As the list knows, I am a fan of HSMs - for the very good reason of having good experience with them. There are still arguments made that 'front line tier = fast SCSI/fibrechannel disk' 'second line and lwoer tier = SATA' and the sales types say SATA is slower and less reliable. Mark, you make the very good point that the world is changing (or indeed has changed) and you should be looking at an infinitely expandable disk based setup - just add more disks into the slot, more JBODs, whatever. Actually, as a complete aside here I have been lookign at Virtual Tape Libraries. One of the Spectralogic models actually eats SATA drives just liek they are tapes - I'm now going to counter your argument - let's say we have an expensive parallel filestore such as Panasas. Or maybe Lustre. So your researchers work on a new project, and need new storage. But they have old projects lying around. They argue they might revisit them, they might need this data, someone might take on a PhD student to trawl through it, or you are in the movie business and your movie has premiered yet there is a directors cut scheduled for next year... OK, so you can add more Panasas. Cue salesman buying in a large bucket of glee to rub his hands in. I agree may argument holds less water with Lustre. From hearnsj at googlemail.com Fri Jun 5 07:23:19 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 15:23:19 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <9f8092cc0906050723g1b395728wd1c5d0a61e11c896@mail.gmail.com> 2009/6/5 Mark Hahn : > > I'm not sure - is there some clear indication that one level of storage is > not good enough? That is well worthy of a debate. As the list knows, I am a fan of HSMs - for the very good reason of having good experience with them. There are still arguments made that 'front line tier = fast SCSI/fibrechannel disk' 'second line and lwoer tier = SATA' and the sales types say SATA is slower and less reliable. Mark, you make the very good point that the world is changing (or indeed has changed) and you should be looking at an infinitely expandable disk based setup - just add more disks into the slot, more JBODs, whatever. Actually, as a complete aside here I have been lookign at Virtual Tape Libraries. One of the Spectralogic models actually eats SATA drives just liek they are tapes - I'm now going to counter your argument - let's say we have an expensive parallel filestore such as Panasas. Or maybe Lustre. So your researchers work on a new project, and need new storage. But they have old projects lying around. They argue they might revisit them, they might need this data, someone might take on a PhD student to trawl through it, or you are in the movie business and your movie has premiered yet there is a directors cut scheduled for next year... OK, so you can add more Panasas. Cue salesman buying in a large bucket of glee to rub his hands in. I agree may argument holds less water with Lustre. From hahn at mcmaster.ca Fri Jun 5 07:22:55 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 5 Jun 2009 10:22:55 -0400 (EDT) Subject: [Beowulf] dedupe filesystem In-Reply-To: <200906051609.54829.kilian.cavalotti.work@gmail.com> References: <200906051429.51888.kilian.cavalotti.work@gmail.com> <200906051609.54829.kilian.cavalotti.work@gmail.com> Message-ID: >>> The best of both worlds would certainly be a central, fast storage >>> filesystem, coupled with a hierarchical storage management system. >> >> I'm not sure - is there some clear indication that one level of storage is >> not good enough? > > I guess it strongly depends on your workload and applications. If your users > tend to keep all their files for long-term purposes, as Bogdan Costescu > pertinently described earlier, it might make sense to transparently free up > the fast centralized filesystem and move the unused-at-the-moment-but-still- > crucially-important files to a slower, farther filesystem (or tapes). no: my point concerned only whether there is a meaningful distinction between what you call fast and slow storage. from a hardware perspective, there is can be differences in latency, though probably not noticed by anything other than benchmarks or quite specialized apps. there can also be aggregate throughput differences: an OST on QDR IB vs Gb. so, an open question: do HPC people still buy high-end disks (small FC/SAS disks, usually also 10-15k)? we're putting them in our next MDS, but certainly not anywhere else. another question: do HPC people still buy tapes? I'm personally opposed, but some apparently reasonable people find them comforting (oddly, because they can to some limited extent be taken offline.) >> this seems like a bad design to me. I would think (and I'm reasonably >> familiar with Lustre, though not an internals expert) that if you're going >> to touch Lustre interfaces at all, you should simply add cheaper, >> higher-density OSTs, and make more intelligent placement/migration >> heuristics. > > In Lustre, that would be done through OST pools. Eh, isn't this also a feature > CEA contributed to? :) is there a documented, programmatic lustre interface to control OST placement (other than lfs setstripe)? such an interface would also need a high-performance way to query the MDS directly (not through the mounted FS, which is too slow for any seriously large FS.) From hearnsj at googlemail.com Fri Jun 5 07:24:51 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 15:24:51 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050723g1b395728wd1c5d0a61e11c896@mail.gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050723g1b395728wd1c5d0a61e11c896@mail.gmail.com> Message-ID: <9f8092cc0906050724r4aaef891x248e5fd76f78ad48@mail.gmail.com> ps. The robinhood file scanning utility which the Lustre DSM project intends to use looked good to me. I downloaded it and tried to compile it up - it claims to compile without having all the Lutre libraries on the system, but did not. Grrrrrr.... From eugen at leitl.org Fri Jun 5 07:32:37 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 5 Jun 2009 16:32:37 +0200 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <20090605143237.GX23524@leitl.org> On Fri, Jun 05, 2009 at 09:52:55AM -0400, Mark Hahn wrote: > IMO, this is a dubious assertion. I bought a couple incredibly cheap > desktop disks for home use a couple weeks ago: just seagate 7200.12's. Are you happy with the 7200.12 so far? I must admit the awful 7200.11 (the 1 and 1.5 TByte variety) has quite soured me on Seagate. I haven't had any problems with WD so far (I'm strictly using RE3 and RE4's, though). > these are of the latest 500G/platter generation, so have the high density > and thus bandwidth: > http://www.sharcnet.ca/~hahn/7200.12.png > > sure, your application may require low-latency. but bandwidth is easy. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From laytonjb at att.net Fri Jun 5 07:33:59 2009 From: laytonjb at att.net (Jeff Layton) Date: Fri, 05 Jun 2009 10:33:59 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050724r4aaef891x248e5fd76f78ad48@mail.gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050723g1b395728wd1c5d0a61e11c896@mail.gmail.com> <9f8092cc0906050724r4aaef891x248e5fd76f78ad48@mail.gmail.com> Message-ID: <4A292CD7.90009@att.net> John Hearns wrote: > ps. The robinhood file scanning utility which the Lustre DSM project > intends to use looked good to me. > I downloaded it and tried to compile it up - it claims to compile > without having all the Lutre libraries on the system, > but did not. Grrrrrr... > AFAIK it's Lustre specific :) From tortay at cc.in2p3.fr Fri Jun 5 07:51:28 2009 From: tortay at cc.in2p3.fr (Loic Tortay) Date: Fri, 05 Jun 2009 16:51:28 +0200 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> Message-ID: <4A2930F0.404@cc.in2p3.fr> Mark Hahn wrote: [...] > > this seems like a bad design to me. I would think (and I'm reasonably > familiar with Lustre, though not an internals expert) that if you're > going to touch Lustre interfaces at all, you should simply add cheaper, > higher-density > OSTs, and make more intelligent placement/migration heuristics. I guess > that CEA already has a vast investment in some existing HSM, so can't do > this. > Last time I talked with one of the people in charge of this at CEA, they had something like 5 petabytes of Lustre based disk storage and, if I'm not mistaken, a bit more in their HSM (HPSS from IBM). That was about 18 months ago, I'm pretty sure they now have much more Lustre storage. Similar HSM interfaces for other cluster filesystems have been available for some time. For instance, QFS w/ SAM-FS, GPFS w/ HPSS (in several forms), GPFS w/ "external pools" and probably others (like StorNext). Regarding your question about tapes, these are still widely used in HEP data processing where data volumes are in the petabytes scale for the older experiments and will be, at least, in the tens of petabytes scale for the LHC experiments at CERN. Lo?c. -- | Lo?c Tortay - IN2P3 Computing Centre | From landman at scalableinformatics.com Fri Jun 5 08:00:09 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Jun 2009 11:00:09 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> Message-ID: <4A2932F9.1010304@scalableinformatics.com> John Hearns wrote: > 2009/6/5 Mark Hahn : >> I'm not sure - is there some clear indication that one level of storage is >> not good enough? I hope I pointed this out before, but Dedup is all about reducing the need for the less expensive 'tier'. Tiered storage has some merits, especially in the 'infinite size' storage realm. Take some things offline, leave things you need online until they go dormant. Define dormant on your own terms. > That is well worthy of a debate. Tiered makes sense in the sense of HSMs. Not so much (for HPC ... and increasingly for business). > > As the list knows, I am a fan of HSMs - for the very good reason of > having good experience with them. > > There are still arguments made that 'front line tier = fast > SCSI/fibrechannel disk' 'second line and lwoer tier = SATA' and the > sales > types say SATA is slower and less reliable. These arguments are still being made by many folks with vested interests in the expensive FC solutions. This is where Dedup plays. Reduce the need for the second tier, and you will get less pressure to drop your prices. The added benefit is that backups should take less time, DR can take less time. And fewer of those meddling smaller storage vendors with big and honking fast disks need be around their turf ... > Mark, you make the very good point that the world is changing (or > indeed has changed) and you should be looking at an infinitely > expandable disk based setup - just add more disks into the slot, more > JBODs, whatever. Yeah, change happens. Those who resist inevitable change will be on the dustheap in short order, if their business model/requirements don't adapt. The worlds largest data repository doesn't do dedup, or use 'tiered' storage. Rather they embrace duplication (n-plication actually), and 'flat' single tiers. There is a reason for this, and it is driven by cost and performance. > Actually, as a complete aside here I have been lookign at Virtual Tape > Libraries. One of the Spectralogic models actually eats SATA drives > just liek they are tapes - I had read about this unit. Will have to speak with them at some point :) > > I'm now going to counter your argument - let's say we have an > expensive parallel filestore such as Panasas. Or maybe Lustre. > So your researchers work on a new project, and need new storage. But > they have old projects lying around. > They argue they might revisit them, they might need this data, someone > might take on a PhD student to trawl through it, Yeah ... when I started in my studies, I reworked older calcs, and revisited 1-5 year old data. Having easy access to this data is important. Access should be transparent. Don't mind waiting a short while (seconds) for the initial access, but subsequent needs to be fast. > or you are in the movie business and your movie has premiered yet > there is a directors cut scheduled for next year... That is one of the markets we are looking at. > OK, so you can add more Panasas. Cue salesman buying in a large bucket > of glee to rub his hands in. > I agree may argument holds less water with Lustre. Look at it this way ... what is the cost/benefit to the movie-company to buy/build expensive storage and build tiers, as compared to much less expensve replicated/HSMed storage? I think the writing is clearly on the wall on this. Lots of the folks in this industry will disagree, but follow what the customers are actually doing. Joe ps: [commmercial content] We have a stake in this stuff, so standard bias disclaimers apply to my post. See today's InsideHPC for more ... http://insidehpc.com [/commercial content] -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hearnsj at googlemail.com Fri Jun 5 08:12:42 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 16:12:42 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A2932F9.1010304@scalableinformatics.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> <4A2932F9.1010304@scalableinformatics.com> Message-ID: <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> 2009/6/5 Joe Landman : > > > Look at it this way ... what is the cost/benefit to the movie-company to > buy/build expensive storage and build tiers, as compared to much less > expensve replicated/HSMed storage? ?I think the writing is clearly on the > wall on this. ?Lots of the folks in this industry will disagree, but follow > what the customers are actually doing. > There is a science fiction novel which describes how women will live forever. As women become older, their life expectancy will increase as new and expensive treatments become available to medical science to extend their lifetime. At the point where the rate of increase becomes more than one year added per year, you are effectively immortal. Of course only women, being wise enough to invest their money in compound interest bearing schemes will benefit from this. Men, who smoke and drink at eg. the LECBIG, will die too early for their money to pay for extended treatments. So what we really want is a storage system that will swallow up drives as they get bigger and bigger - so as your researchers create more and more data, or stream in more and more satellite/accelerator data/logs of phone calls (a la GCHQ) then your storage system is expanding at a faster rate. From landman at scalableinformatics.com Fri Jun 5 08:21:07 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Jun 2009 11:21:07 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> <4A2932F9.1010304@scalableinformatics.com> <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> Message-ID: <4A2937E3.6080503@scalableinformatics.com> John Hearns wrote: > There is a science fiction novel which describes how women will live > forever. As women become older, their life expectancy will increase as > new and expensive treatments become available to medical science to > extend their lifetime. At the point where the rate of increase becomes > more than one year added per year, you are effectively immortal. Of > course only women, being wise enough to invest their money in compound > interest bearing schemes will benefit from this. Men, who smoke and > drink at eg. the LECBIG, will die too early for their money to pay for > extended treatments. > > So what we really want is a storage system that will swallow up drives > as they get bigger and bigger - so as your researchers create more and > more data, or stream in more and more satellite/accelerator data/logs > of phone calls (a la GCHQ) then your storage system is expanding at a > faster rate. Mmmm.... tasty bits .... mmmm -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Fri Jun 5 09:55:35 2009 From: james.p.lux at jpl.nasa.gov (Lux, James P) Date: Fri, 5 Jun 2009 09:55:35 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A2932F9.1010304@scalableinformatics.com> Message-ID: Isn't de-dupe just another flavor, conceptually, of a journaling file system..in the sense that in many systems, only a small part of the file actually changes each time, so saving "diffs" allows one to reconstruct any arbitrary version with much smaller file space. I guess the de-dupe is a bit more aggressive than that, in that it theoretically can look for common "stuff" between unrelated files, so maybe a better model is a "data compression" algorithm on the fly. And for that, it's all about trading between cost of storage space, retrieval time, and computational effort to run the algorithm. (Reliability factors into it a bit.. Compression removes redundancy, after all, but the defacto redundancy provided by having previous versions around isn't a good "system" solution, even if it's the one people use) I think one can make the argument that computation is always getting cheaper, at a faster rate than storage density or speed (because of the physics limits on the storage...), so the "span" over which you can do compression can be arbitrarily increased over time. TIFF and FAX do compression over a few bits. Zip and it's ilk do compression over kilobits or megabits (depending on whether they build a custom symbol table). Dedupe is doing compression over Gigabits and terabits, presumably (although I assume that there's a granularity at some point.. A dedupe system looks at symbols that are, say, 512 bytes long, as opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit symbols. The hierarchical storage is really optimizing along a different axis than compression. It's more like cache than compression.. Make the "average time to get to the next bit you need" smaller rather than "make smaller number of bits" Granted, for a lot of systems, "time to get a bit" is proportional to "number of bits" On 6/5/09 8:00 AM, "Joe Landman" wrote: John Hearns wrote: > 2009/6/5 Mark Hahn : >> I'm not sure - is there some clear indication that one level of storage is >> not good enough? I hope I pointed this out before, but Dedup is all about reducing the need for the less expensive 'tier'. Tiered storage has some merits, especially in the 'infinite size' storage realm. Take some things offline, leave things you need online until they go dormant. Define dormant on your own terms. -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Fri Jun 5 10:04:55 2009 From: james.p.lux at jpl.nasa.gov (Lux, James P) Date: Fri, 5 Jun 2009 10:04:55 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> Message-ID: On 6/5 So what we really want is a storage system that will swallow up drives as they get bigger and bigger - so as your researchers create more and more data, or stream in more and more satellite/accelerator data/logs of phone calls (a la GCHQ) then your storage system is expanding at a faster rate. --- Many years ago I read an interesting paper talking about how modern user interfaces are hobbled by assumptions incorporated decades ago. When disk space is slow and precious, having users decide to explicitly save their file while editing is a good idea. (don't even contemplate casette tape on microcomputers..). Now, though, disk is cheap and fast and so are processors, so there's really no reason why you shouldn't store your word processing as a chain of keystrokes, with infinite undo available. Say I spent 8 hours a day doing nothing but typing at 100wpm.. That's 480 minutes * 500 characters/minute.. Call it a measly 250,000 bytes per day. Heck, the 2GB of RAM in the macbook I'm typing this on would hold 8000 days of work. In reality, a few GB would probably hold more characters than I will type in my entire life (or mouse clicks, etc.) In theory, then, with sufficient computational power (and that's what this list is all about) with the data on a small thumb drive I should be able to reconstruct everything, in every version, I've ever created or will create. All it takes is a sufficiently powerful "rendering engine" I readily concede that much data that is stored on computers is NOT the direct result of someone typing. Imagery is probably the best example of huge data that isn't suitable for the "base version + all diffs" model. -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Fri Jun 5 10:12:39 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Jun 2009 13:12:39 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: Message-ID: <4A295207.5070004@scalableinformatics.com> Lux, James P wrote: > Isn?t de-dupe just another flavor, conceptually, of a journaling file > system..in the sense that in many systems, only a small part of the file > actually changes each time, so saving ?diffs? allows one to reconstruct > any arbitrary version with much smaller file space. Its really more conceptually like RLE (run length encoding) or simple compression where you start with a pattern and a dictionary, and point out where in the file that pattern repeats. > I guess the de-dupe is a bit more aggressive than that, in that it > theoretically can look for common ?stuff? between unrelated files, so It only looks at raw blocks. If they have the same hash signatures (think like MD5 or SHA ... hopefully with fewer collisions), then they are duplicates. > maybe a better model is a ?data compression? algorithm on the fly. Yup this is it, but on the fly is the hard part. Doing this comparison is computationally very expensive. The hash calculations are not cheap by any measure. You most decidedly do not wish to do this on the fly ... > And for that, it?s all about trading between cost of storage space, > retrieval time, and computational effort to run the algorithm. Exactly. > (Reliability factors into it a bit.. Compression removes redundancy, > after all, but the defacto redundancy provided by having previous > versions around isn?t a good ?system? solution, even if it?s the one > people use) :) You get a direct CBA comparison between buying the N+1th disk, and the time/effort/money to perform this computation. In the end, the former wins. > I think one can make the argument that computation is always getting > cheaper, at a faster rate than storage density or speed (because of the > physics limits on the storage...), so the ?span? over which you can do > compression can be arbitrarily increased over time. TIFF and FAX do > compression over a few bits. Zip and it?s ilk do compression over > kilobits or megabits (depending on whether they build a custom symbol > table). Dedupe is doing compression over Gigabits and terabits, > presumably (although I assume that there?s a granularity at some point.. > A dedupe system looks at symbols that are, say, 512 bytes long, as > opposed to ZIP looking at 8bit symbols, or Group4 Fax looking at 1 bit > symbols. Most Dedup are over blocks, and I think most are doing 512 bytes or 4k bytes. The point is that even if theoretically computation is getting cheaper, hash computations (the ones without collisions, as collisional hashes are ... um ... not good for Dedup), the calculation of the hash is still a significant bit of computation. One well suited for an accelerator. Which is why the Dedup market seems to be "flooded" with accelerators (which I think are little more than FPGAs implementing some hash computation algorithm) > The hierarchical storage is really optimizing along a different axis > than compression. It?s more like cache than compression.. Make the > ?average time to get to the next bit you need? smaller rather than ?make > smaller number of bits? Basically yes ... though HSM is all about driving down the cost of the large pool as low as possible. Tape is still used, and lots of people make arguments for tape. But as John pointed out Spectra logic is marketing a SATA eating robot, so I think the days of tape are likely more numbered than before. A brief anecdote. In 1989, a fellow graduate student was leaving for another school. He was taking his data with him. He spooled up a Vax 8650 unit with a tape. I asked him why this over other media. His response was, you can read a Vax tape anywhere. In 2009, twenty years later, I think he might have a different take on this. I put all my bits onto floppys when I left there, and moved the important ones to spinning rust. I can still read the floppies. I doubt he can still read the tapes. The point is that tape folks talk about longevity. But this makes a number of important assumptions about the media, the drives, and availability of replacement drives, which, as my advisor in graduate school discovered after her drive died, are not necessarily correct or accurate. > Granted, for a lot of systems, ?time to get a bit? is proportional to > ?number of bits? Yup. But that initial latency can be huge. While the cost of computation is decreasing rapidly, I'll argue that the cost of storage is decreasing as fast if not faster. This has implications for which mode is preferable ... n-plication onto decreasing cost media, or computation to minimize the cost of the ... cheap media footprint. The CBA doesnt favor Dedup in the long term, and though does favor HSM ... even cloud storage. The issues there are bandwidth, bandwidth, and, you guessed it, bandwidth. > > On 6/5/09 8:00 AM, "Joe Landman" wrote: > > John Hearns wrote: > > 2009/6/5 Mark Hahn : > > > I'm not sure - is there some clear indication that one level of > storage is > > > not good enough? > > I hope I pointed this out before, but Dedup is all about reducing the > need for the less expensive 'tier'. Tiered storage has some merits, > especially in the 'infinite size' storage realm. Take some things > offline, leave things you need online until they go dormant. Define > dormant on your own terms. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hearnsj at googlemail.com Fri Jun 5 10:18:13 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 18:18:13 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> Message-ID: <9f8092cc0906051018u745e93f5lca139ed5c028a1e6@mail.gmail.com> 2009/6/5 Lux, James P : >> > In theory, then, with sufficient computational power (and that?s what this > list is all about) ?with the data on a small thumb drive I should be able to > reconstruct everything, ?in every version, I?ve ever created or will create. > ?All it takes is a sufficiently powerful ?rendering engine? > If I am not wrong, you have been reading "Godel, Escher, Bach: An Eternal Braid"? In which case you will next say that all of these keystrokes can be encoded as one unique prime number, which is very easily stored on a very small thuimb drive. From hearnsj at googlemail.com Fri Jun 5 10:25:32 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 5 Jun 2009 18:25:32 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A295207.5070004@scalableinformatics.com> References: <4A295207.5070004@scalableinformatics.com> Message-ID: <9f8092cc0906051025m24b4c436wc18492d9a6770e6f@mail.gmail.com> 2009/6/5 Joe Landman : > > A brief anecdote. ?In 1989, a fellow graduate student was leaving for > another school. ?He was taking his data with him. ?He spooled up a Vax 8650 > unit with a tape. ?I asked him why this over other media. ?His response was, > you can read a Vax tape anywhere. > > In 2009, twenty years later, I think he might have a different take on this. > ?I put all my bits onto floppys when I left there, and moved the important > ones to spinning rust. ?I can still read the floppies. ?I doubt he can still > read the tapes. This is I think referred to as 'digital archaeology' > The point is that tape folks talk about longevity. ?But this makes a number > of important assumptions about the media, the drives, and availability of > replacement drives, which, as my advisor in graduate school discovered after > her drive died, are not necessarily correct or accurate. > There. You have the concept - now, to add value to my SATA eating expanding storage array, you need to engineer it so your company can come along and bolt onto it the next type of storage - cakes of Blu-ray disks, multi packs of thumb drives, or whatever. The smart storage array will already be migrating your data before you even know it is out of date. The hard part comes in disguising the bills to the Chief Finance Officer. From landman at scalableinformatics.com Fri Jun 5 10:43:27 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Jun 2009 13:43:27 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906051025m24b4c436wc18492d9a6770e6f@mail.gmail.com> References: <4A295207.5070004@scalableinformatics.com> <9f8092cc0906051025m24b4c436wc18492d9a6770e6f@mail.gmail.com> Message-ID: <4A29593F.4050405@scalableinformatics.com> John Hearns wrote: >> In 2009, twenty years later, I think he might have a different take on this. >> I put all my bits onto floppys when I left there, and moved the important >> ones to spinning rust. I can still read the floppies. I doubt he can still >> read the tapes. > > This is I think referred to as 'digital archaeology' :) (visions of scientists holding up some rust as they dig through a plastic tape, and saying "look, I found a bit!!!") >> The point is that tape folks talk about longevity. But this makes a number >> of important assumptions about the media, the drives, and availability of >> replacement drives, which, as my advisor in graduate school discovered after >> her drive died, are not necessarily correct or accurate. >> > > There. You have the concept - now, to add value to my SATA eating > expanding storage array, you need to engineer it > so your company can come along and bolt onto it the next type of > storage - cakes of Blu-ray disks, multi packs of thumb drives, or > whatever. The smart storage array will already be migrating your data > before you even know it is out of date. > The hard part comes in disguising the bills to the Chief Finance Officer. Actually, what you described is *exactly* cloud storage. And the CFO would love (generally) to pay for it. Add whatever capacity you need, and pay for it ... only when you need it. Lowers the cost per TB or per GB ... however you want to view it. Your cost to run 1TB includes power, cooling, space, etc. Your cost to increment this costs whatever quantum of storage you currently pay in whatever size you pay for it. What if, rather than in large "kerchunk" amounts (with gleeful sales critters rubbing hands together), it was in effectively whatever size amount you needed? Without turning this into a commercial, we are working with a few folks in this regime. Anyone interested in this stuff, bug me offline. Do remember, TANSTAAFL though ... you have to pay the storage loaded cost, the bandwidth costs and latency to get data. I have a feeling, if the governments really invest in infrastructure that this might be much less of an issue going forward ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Fri Jun 5 10:51:00 2009 From: james.p.lux at jpl.nasa.gov (Lux, James P) Date: Fri, 5 Jun 2009 10:51:00 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906051018u745e93f5lca139ed5c028a1e6@mail.gmail.com> Message-ID: On 6/5/09 10:18 AM, "John Hearns" wrote: > 2009/6/5 Lux, James P : >>> >> In theory, then, with sufficient computational power (and that?s what this >> list is all about) ?with the data on a small thumb drive I should be able to >> reconstruct everything, ?in every version, I?ve ever created or will create. >> ?All it takes is a sufficiently powerful ?rendering engine? >> > If I am not wrong, you have been reading "Godel, Escher, Bach: An > Eternal Braid"? > In which case you will next say that all of these keystrokes can be > encoded as one unique prime number, which is very > easily stored on a very small thuimb drive. > Gosh. I haven't thought about GEB for decades, but I don't think you need to restrict it to primes. I just need MY unique number (and the cool thing is that I can retire, because knowing that, and having that rendering engine, I can render the future as well as the past.) Let's see... World population of a few billion, maybe add a few bits for redundancy/ECC.. Everyone needs a 40 bit or so number. Hmmm... Maybe that's why DES has 56 bit keys? Everyone has their personal DES key. Or is there some more mystical reason.. I must go find my copies of Umberto Eco.. From stewart at serissa.com Fri Jun 5 12:01:59 2009 From: stewart at serissa.com (Lawrence Stewart) Date: Fri, 5 Jun 2009 15:01:59 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: Message-ID: On Jun 5, 2009, at 1:04 PM, Lux, James P wrote: > > --- > Many years ago I read an interesting paper talking about how modern > user interfaces are hobbled by assumptions incorporated decades > ago. When disk space is slow and precious, having users decide to > explicitly save their file while editing is a good idea. (don?t even > contemplate casette tape on microcomputers..). Now, though, disk is > cheap and fast and so are processors, so there?s really no reason > why you shouldn?t store your word processing as a chain of > keystrokes, with infinite undo available. Say I spent 8 hours a day > doing nothing but typing at 100wpm.. That?s 480 minutes * 500 > characters/minute.. Call it a measly 250,000 bytes per day. Heck, > the 2GB of RAM in the macbook I?m typing this on would hold 8000 > days of work. In reality, a few GB would probably hold more > characters than I will type in my entire life (or mouse clicks, etc.) That takes me back. The Cedar computing environment at Xerox PARC did this around 1981. Every input event, including mouse input and keyboard input, got a 48-bit timestamp, IIRC. This was done by Dan Swinehart, who is still there I think. The idea was to never get user events out of order or delivered to the wrong window due to UI slowness like moving windows. The text editors didn't lose your work either. This was on 4 MIPS (about) machines - the Dorado. They seemed fast to us: "it sucks the keystrokes right out of my fingers". -Larry -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Fri Jun 5 12:09:40 2009 From: stewart at serissa.com (Lawrence Stewart) Date: Fri, 5 Jun 2009 15:09:40 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A295207.5070004@scalableinformatics.com> References: <4A295207.5070004@scalableinformatics.com> Message-ID: On Jun 5, 2009, at 1:12 PM, Joe Landman wrote: > Lux, James P wrote: > > It only looks at raw blocks. If they have the same hash signatures > (think like MD5 or SHA ... hopefully with fewer collisions), then > they are duplicates. > >> maybe a better model is a ?data compression? algorithm on the fly. > > Yup this is it, but on the fly is the hard part. Doing this > comparison is computationally very expensive. The hash calculations > are not cheap by any measure. You most decidedly do not wish to do > this on the fly ... > >> And for that, it?s all about trading between cost of storage space, >> retrieval time, and computational effort to run the algorithm. > > Exactly. I think the hash calculations are pretty cheap, actually. I just timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on one core (from the disk cache). That is substantially faster than the disk transfer rate. If you have a parallel filesystem, you can parallize the hashes as well. -L From hahn at mcmaster.ca Fri Jun 5 15:07:12 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 5 Jun 2009 18:07:12 -0400 (EDT) Subject: [Beowulf] dedupe filesystem In-Reply-To: <20090605143237.GX23524@leitl.org> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <20090605143237.GX23524@leitl.org> Message-ID: >> IMO, this is a dubious assertion. I bought a couple incredibly cheap >> desktop disks for home use a couple weeks ago: just seagate 7200.12's. > > Are you happy with the 7200.12 so far? I must admit the awful 7200.11 (the 1 and > 1.5 TByte variety) has quite soured me on Seagate. I haven't had any problems > with WD so far (I'm strictly using RE3 and RE4's, though). 7200.x's are clearly just desktop disks - I wouldn't necessarily build a large raid out of them, or expect to run them 24x7 for years. but these days, you really have to regard disks as cheap consumables, at least if you're not pushing max size or enterprise models. I had myself talked into wd re's as well, but then I realized that a few 7200.12's are just to cheap to care - es.# and re# lines cost over twice as much. and if we're talking about $50 disks, is it really worth talking about? I have no reliability experience yet. it's clear though that disk vendors have heard all the discussions on how URE/NRE rates limit the usability of large raid arrays. (ie, 7200.12's are "1 NRE per 10e14 bits read". I wonder whether they mean just 10^14, since 10e14 should be normalized to 1e15...) in any case, my home array is not going to be close to either 12.5 or 125 TB, so I don't worry about the too-big-to-rebuild-without-URE issue. but I did notice that some drive lines (wd caviar black) actually appear to increase the ECC on bigger models. the 500/640G models are rated as <1 in 10^14, but the 750/1000 step up to <1/10^15. the sustained bw drops from 113 to 106 MB/s, supporting the idea... regards, mark hahn. From landman at scalableinformatics.com Sun Jun 7 21:55:10 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 08 Jun 2009 00:55:10 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <4A295207.5070004@scalableinformatics.com> Message-ID: <4A2C99AE.7030906@scalableinformatics.com> Lawrence Stewart wrote: [...] >> Yup this is it, but on the fly is the hard part. Doing this >> comparison is computationally very expensive. The hash calculations >> are not cheap by any measure. You most decidedly do not wish to do >> this on the fly ... The assumption of a high performance disk/file system is implicit here. >> >>> And for that, it?s all about trading between cost of storage space, >>> retrieval time, and computational effort to run the algorithm. >> >> Exactly. > > > I think the hash calculations are pretty cheap, actually. I just timed > sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per second, on > one core (from the disk cache). That is substantially faster than the > disk transfer rate. If you have a parallel filesystem, you can > parallize the hashes as well. Disk transfer rates are 100-120 MB/s these days. For high performance local file systems (N * disks), the data rates you showed for sha1 won't cut it. Especially if they have multiple hash signatures computed in order to avoid hash collisions (ala MD5 et al). The probability of two different hashes having two or more unique blocks that have a collision in the same manner is somewhat smaller than the probability that a single hash function has two (or more) unique blocks that have a collision. So you compute two or more in different ways, to reduce the probability of that collision. For laughs, I just tried this on our lab server. We have a 500 MB/s file system attached to it, so we aren't so concerned about data IO bottlenecks. First run, not in cache: landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso |sha1sum 0.02user 12.88system 0:47.22elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k 8901280inputs+0outputs (0major+2259minor)pagefaults 0swaps f8ca12b4acc714f4e4a21f3f35af083952ab46e0 - second run, in cache: landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso |sha1sum 0.01user 8.79system 0:37.47elapsed 23%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2259minor)pagefaults 0swaps f8ca12b4acc714f4e4a21f3f35af083952ab46e0 - The null version of the hash, just writing to /dev/null: landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso | cat - > /dev/null 0.00user 8.37system 0:10.96elapsed 76%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2258minor)pagefaults 0swaps So ~11 seconds of the results here are dealing with the pipes and internal unix bits. Removing the pipe landman at crunch:/data/big/isos/centos/centos5$ /usr/bin/time cat CentOS-5.3-x86_64-bin-DVD.iso > /dev/null 0.00user 4.95system 0:04.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+2258minor)pagefaults 0swaps so its about 6 seconds cost to use the pipe when data is in cache, and the raw filesystem w/o cache landman at crunch:/data/big/isos/centos/centos5$ dd if=CentOS-5.3-x86_64-bin-DVD.iso of=/dev/null iflag=direct bs=32M 135+1 records in 135+1 records out 4557455360 bytes (4.6 GB) copied, 8.72511 s, 522 MB/s Where I am getting at with this, is every added layer costs you time, whether or not the data is in cache or on disk. Anything getting in path of moving data to disk, every pipe, any hash/checksum you compute is going to cost you. The more hashes/checksums you compute, the more time you spend computing. And the less time moving data (for a fixed window of time). Which lowers your effective bandwidth. Of course, that isn't the only thing you have to worry about. You have to do a hash table lookup as well for dedup. While hash table lookups are pretty efficient, you may need to queue up a hash table insertion if the block wasn't found. And you will need lots of ram to hold these hash tables. Parallel hash table insertion is possible. Parallel lookup is possible. Get N machines operating on one query should be N times faster if you can partition your table N ways. Get M simultaneous queries going, with table inserts, etc. ... well, maybe not so much fast. This is part of the reason that Dedup is used in serial backup pathways, usually with dedicated controller boxes. You can leverage software to do this, but you are going to allocate lots of resources to get the hash table computation, lookup and table management going fast. My point was that its hard to do this on the fly in general. You are right that some aspects can be easilydone in parallel. But there are elements that can't be, and the hash computations are still expensive on modern processors. Haven't tried it on the Nehalem yet, but I don't expect much difference. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From stewart at serissa.com Mon Jun 8 05:42:23 2009 From: stewart at serissa.com (Lawrence Stewart) Date: Mon, 8 Jun 2009 08:42:23 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: <4A2C99AE.7030906@scalableinformatics.com> References: <4A295207.5070004@scalableinformatics.com> <4A2C99AE.7030906@scalableinformatics.com> Message-ID: On Jun 8, 2009, at 12:55 AM, Joe Landman wrote: > Lawrence Stewart wrote: > > [...] > >>> Yup this is it, but on the fly is the hard part. Doing this >>> comparison is computationally very expensive. The hash >>> calculations are not cheap by any measure. You most decidedly do >>> not wish to do this on the fly ... > > The assumption of a high performance disk/file system is implicit > here. > >>> >>>> And for that, it?s all about trading between cost of storage >>>> space, retrieval time, and computational effort to run the >>>> algorithm. >>> >>> Exactly. >> I think the hash calculations are pretty cheap, actually. I just >> timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per >> second, on one core (from the disk cache). That is substantially >> faster than the disk transfer rate. If you have a parallel >> filesystem, you can parallize the hashes as well. > > Disk transfer rates are 100-120 MB/s these days. For high > performance local file systems (N * disks), the data rates you > showed for sha1 won't cut it. Especially if they have multiple hash > signatures computed in order to avoid hash collisions (ala MD5 et > al). The probability of two different hashes having two or more > unique blocks that have a collision in the same manner is somewhat > smaller than the probability that a single hash function has two (or > more) unique blocks that have a collision. So you compute two or > more in different ways, to reduce the probability of that collision. > > For laughs, I just tried this on our lab server. We have a 500 MB/s > file system attached to it, so we aren't so concerned about data IO > bottlenecks. Actual measurements and careful analysis beat handwaving every time :-) My assumptions were a bit different I guess, still figuring 50MB/s per spindle, and supposing that are computing the hashes, rather than the servers running the disks. If that could be arranged, then the cores available to compute hashes scale with the clients. In any case, I am arguing for deduplication for high performance filesystems, I think it is a decent idea for backups though. Regarding hash collisions, the relevant math is the "birthday problem" I think. If the hash values are uniformly distributed, as they should be, then the probability of a collision rises to about 1/2 when the number of blocks reaches the square root of the size of the value space. So you would have about 50% chance of a collision if you have 4 billion blocks (32 bits) and are using 64 bit hashes. If multiple hashes are independent, the you get to add the sizes before taking the square root. 256 bit hashes ought to give negligible odds of a collision up to 64 bits worth of blocks, where "negligible" means much less than other sources of permanently lost data. However it might make more sense from a system design perspective to use the hash as a hint, and to actually compare the data. This would force a random block read to confirm every duplicate. Hmm, let's convert sequential writes into random reads... The compression ratio of 4K blocks to 16 byte hashes is also suspect, this is 250 to one, and the incremental cost ratio of disk and ram is not much different. So keeping the hashes in ram is probably too expensive. So in summary, deduplication is messy, complicated, has bad performance, and uncertain economics for HPC. Let's not do it. -L From deadline at eadline.org Mon Jun 8 08:54:25 2009 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 8 Jun 2009 11:54:25 -0400 (EDT) Subject: [Beowulf] dedupe filesystem In-Reply-To: <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> References: <9f8092cc0906030310j67d09343y3eab84a06c367a65@mail.gmail.com> <200906051429.51888.kilian.cavalotti.work@gmail.com> <9f8092cc0906050721s2aabe904m4bb1514941bbe9b6@mail.gmail.com> <4A2932F9.1010304@scalableinformatics.com> <9f8092cc0906050812l83ef801ic7d4568c3e3a153b@mail.gmail.com> Message-ID: <51111.192.168.1.213.1244476465.squirrel@mail.eadline.org> treatments, we don't need no stinking treatments And by the way, that is LECCIBG (pronounced "leccibg"). For the curious, http://www.clustermonkey.net/content/view/164/57/ -- Doug > 2009/6/5 Joe Landman : >> > >> Look at it this way ... what is the cost/benefit to the movie-company to >> buy/build expensive storage and build tiers, as compared to much less >> expensve replicated/HSMed storage? ?I think the writing is clearly on >> the >> wall on this. ?Lots of the folks in this industry will disagree, but >> follow >> what the customers are actually doing. >> > There is a science fiction novel which describes how women will live > forever. As women become older, their life expectancy will increase as > new and expensive treatments become available to medical science to > extend their lifetime. At the point where the rate of increase becomes > more than one year added per year, you are effectively immortal. Of > course only women, being wise enough to invest their money in compound > interest bearing schemes will benefit from this. Men, who smoke and > drink at eg. the LECBIG, will die too early for their money to pay for > extended treatments. > > So what we really want is a storage system that will swallow up drives > as they get bigger and bigger - so as your researchers create more and > more data, or stream in more and more satellite/accelerator data/logs > of phone calls (a la GCHQ) then your storage system is expanding at a > faster rate. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From lindahl at pbm.com Mon Jun 8 13:52:48 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 8 Jun 2009 13:52:48 -0700 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <1243964380.30944.8.camel@localhost.localdomain> <4A26781B.1020907@scalableinformatics.com> Message-ID: <20090608205247.GL19555@bx9.net> >> It might be worth noting that dedup is not intended for high >> performance file systems ... the cost of computing the hash(es) >> is(are) huge. > > Some file systems do (or claim to do) checksumming for data integrity > purposes, this seems to me like the perfect place to add the computation > of a hash - with data in cache (needed for checksumming anyay), the > computation should be fast. Filesystems may call it a "checksum" but it's usually a hash. We use a Jenkins hash, which is fast and a lot better than, say, the TCP checksum. But it's a lot weaker than an expensive hash. If your dedup is going to fall back to byte-by-byte comparisons, it could be that a weak hash would be good enough. -- greg From brockp at umich.edu Mon Jun 8 14:49:31 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 8 Jun 2009 17:49:31 -0400 Subject: [Beowulf] Atlas detector from LHC on podcast Message-ID: <6DD2CB87-17B7-4F73-B646-9C27B7B40F9F@umich.edu> Some may find this interesting, I had Shawn Mckee of the ATLAS project on the podcast, you can hear about mini black holes and how they filter 1PB/s down to a manageable amount of data and farm it on grids all over the world. http://www.rce-cast.com/index.php/Podcast/rce-11-um-atlas.html Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From polk678 at gmail.com Tue Jun 9 01:58:29 2009 From: polk678 at gmail.com (gossips J) Date: Tue, 9 Jun 2009 14:28:29 +0530 Subject: [Beowulf] HPMPI ove uDAPL issue Message-ID: We have been chasing issue with HPMPI running over uDAPL path. Issue: Not able to test Window test of IMB-EXT for 2 processes 2 nodes [1 proc/node]. It is giving below errors: ========================================================= #---------------------------------------------------------------- # Benchmarking Window # #processes = 2 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 100 125.37 125.37 125.37 4 100 161.92 166.28 164.10 8 100 163.65 163.65 163.65 16 100 163.57 163.58 163.57 32 100 163.83 163.83 163.83 64 100 163.24 163.25 163.25 128 100 163.73 163.74 163.74 256 100 163.43 163.44 163.43 512 100 169.98 171.06 170.52 1024 100 168.06 168.38 168.22 2048 100 168.47 169.30 168.88 4096 100 168.48 169.42 168.95 8192 100 169.31 169.50 169.41 16384 100 170.14 174.13 172.13 32768 100 170.12 171.75 170.94 65536 100 176.76 179.81 178.29 IMB-EXT: Rank 0:0: MPI_Win_create: Unable to pin memory for window IMB-EXT: Rank 0:0: MPI_Win_create: Unclassified error IMB-EXT: Rank 0:1: MPI_Win_create: Unable to pin memory for window IMB-EXT: Rank 0:1: MPI_Win_create: Unclassified error MPI Application rank 1 exited before MPI_Finalize() with status 15 # 2:34:34 (HPQ) IN: "HP-MPI" root at mytst (2 licenses) ========================================================= command: mpirun -v -UDAPL -e LD_LIBRARY_PATH=/usr/lib64 -e MPI_FLAGS=D -1sided -f /opt/test_bin/IMB/IMB-EXT Window HPMPI version: hpmpi-2.02.07.00-20080408r.x86_64.rpm Build: OFED-1.4.1 GA Does anyone know what is going wrong here...? Do I miss any environment settings ? Waiting for the answers, Thanks, Polk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Thu Jun 11 00:13:12 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 11 Jun 2009 00:13:12 -0700 Subject: [Beowulf] HPMPI ove uDAPL issue In-Reply-To: References: Message-ID: <20090611071312.GC10741@bx9.net> On Tue, Jun 09, 2009 at 02:28:29PM +0530, gossips J wrote: [...] > 32768 100 170.12 171.75 170.94 > 65536 100 176.76 179.81 178.29 > IMB-EXT: Rank 0:0: MPI_Win_create: Unable to pin memory for window > IMB-EXT: Rank 0:0: MPI_Win_create: Unclassified error > IMB-EXT: Rank 0:1: MPI_Win_create: Unable to pin memory for window > IMB-EXT: Rank 0:1: MPI_Win_create: Unclassified error You need to consult the documentation for OFED or perhaps HPMPI about how to increase the amount of locked memory from the default... "ulimit -l" will show you the default. -- greg From cap at nsc.liu.se Thu Jun 11 00:38:53 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Thu, 11 Jun 2009 09:38:53 +0200 Subject: [Beowulf] HPMPI ove uDAPL issue In-Reply-To: References: Message-ID: <200906110939.06180.cap@nsc.liu.se> On Tuesday 09 June 2009, gossips J wrote: > We have been chasing issue with HPMPI running over uDAPL path. > Issue: Not able to test Window test of IMB-EXT for 2 processes 2 nodes [1 > proc/node]. > It is giving below errors: ... > 65536 100 176.76 179.81 178.29 > IMB-EXT: Rank 0:0: MPI_Win_create: Unable to pin memory for window > IMB-EXT: Rank 0:0: MPI_Win_create: Unclassified error ... Most likely you're not allowed to pin enough memory (see ulimit -l). > HPMPI version: > hpmpi-2.02.07.00-20080408r.x86_64.rpm Surely there are more recent HPMPI-versions to go with your bledeing edge OFED? > Build: OFED-1.4.1 GA /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From kus at free.net Thu Jun 11 07:30:48 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Thu, 11 Jun 2009 18:30:48 +0400 Subject: [Beowulf] SuperMicro X8DTi for Nehalem-based nodes Message-ID: Some time ago I asked here about Tyan S7002 motherboards. The situation was changed, and now I have the same question about Supermicro mobos: Is there some "contra-indications" for using of Supermicro X8DTi w/Xeon 5520 for cluster nodes ? May be somebody have some experience w/X8DTi ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From thakur at mcs.anl.gov Fri Jun 12 15:53:10 2009 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Fri, 12 Jun 2009 17:53:10 -0500 Subject: [Beowulf] MPI + CUDA codes Message-ID: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> Is anyone aware of codes out there that use a combination of MPI across nodes and CUDA or OpenCL within a node? Rajeev From landman at scalableinformatics.com Fri Jun 12 17:24:06 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 12 Jun 2009 20:24:06 -0400 Subject: [Beowulf] MPI + CUDA codes In-Reply-To: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> References: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> Message-ID: <4A32F1A6.2000903@scalableinformatics.com> Rajeev Thakur wrote: > Is anyone aware of codes out there that use a combination of MPI across > nodes and CUDA or OpenCL within a node? Hi Rajeev: We were doing this for a customer about a month ago. MPI between nodes, CUDA within, with F90 and C. I am not sure the customer is still working on this, we did get it started for them. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From m.janssens at opencfd.co.uk Sun Jun 14 10:01:45 2009 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Sun, 14 Jun 2009 18:01:45 +0100 Subject: [Beowulf] MPI + CUDA codes In-Reply-To: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> References: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> Message-ID: <200906141801.45351.m.janssens@opencfd.co.uk> On Friday 12 June 2009 23:53, Rajeev Thakur wrote: > Is anyone aware of codes out there that use a combination of MPI across > nodes and CUDA or OpenCL within a node? >From what I remember FEAST uses both: http://www.feast.uni-dortmund.de/publications.html Mattijs From polk678 at gmail.com Fri Jun 12 00:06:03 2009 From: polk678 at gmail.com (gossips J) Date: Fri, 12 Jun 2009 12:36:03 +0530 Subject: [Beowulf] HPMPI ove uDAPL issue In-Reply-To: <200906110939.06180.cap@nsc.liu.se> References: <200906110939.06180.cap@nsc.liu.se> Message-ID: i updated my /etc/secutiry/limit.conf with hard/soft "unlimited" Still the same issue. i am using ofed-1.4.1-ga build with this hpmpi rpm. also tried with physical_mem env settings and pin_persentage as well. this does not helped again. Also observed that this error happens while memroy registration of 256K buffers with "Window" test caes And during the error system has sufficient free pages available. Basically error returns from the CORE (umem.c) file with error -EFAULT in get_user_pages() API. But after running "Accumulate" it is observed that it is able to register upto 4194304 length easily. Thanks, -polk On Thu, Jun 11, 2009 at 1:08 PM, Peter Kjellstrom wrote: > On Tuesday 09 June 2009, gossips J wrote: > > We have been chasing issue with HPMPI running over uDAPL path. > > Issue: Not able to test Window test of IMB-EXT for 2 processes 2 nodes [1 > > proc/node]. > > It is giving below errors: > ... > > 65536 100 176.76 179.81 178.29 > > IMB-EXT: Rank 0:0: MPI_Win_create: Unable to pin memory for window > > IMB-EXT: Rank 0:0: MPI_Win_create: Unclassified error > ... > > Most likely you're not allowed to pin enough memory (see ulimit -l). > > > HPMPI version: > > hpmpi-2.02.07.00-20080408r.x86_64.rpm > > Surely there are more recent HPMPI-versions to go with your bledeing edge > OFED? > > > Build: OFED-1.4.1 GA > > /Peter > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brockp at umich.edu Fri Jun 12 16:54:19 2009 From: brockp at umich.edu (Brock Palen) Date: Fri, 12 Jun 2009 19:54:19 -0400 Subject: [Beowulf] MPI + CUDA codes In-Reply-To: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> References: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> Message-ID: <9D0E7F59-4D7D-40C5-B104-5463260DB824@umich.edu> I think the Namd folks had a paper and data from real running code at SC last year. Check with them. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Jun 12, 2009, at 6:53 PM, Rajeev Thakur wrote: > Is anyone aware of codes out there that use a combination of MPI > across > nodes and CUDA or OpenCL within a node? > > Rajeev > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From charliep at cs.earlham.edu Mon Jun 15 04:09:23 2009 From: charliep at cs.earlham.edu (Charlie Peck) Date: Mon, 15 Jun 2009 07:09:23 -0400 Subject: [Beowulf] MPI + CUDA codes In-Reply-To: <9D0E7F59-4D7D-40C5-B104-5463260DB824@umich.edu> References: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> <9D0E7F59-4D7D-40C5-B104-5463260DB824@umich.edu> Message-ID: <0C87BBD6-0902-4BB4-9A70-2CC0446390E9@cs.earlham.edu> On Jun 12, 2009, at 7:54 PM, Brock Palen wrote: > I think the Namd folks had a paper and data from real running code > at SC last year. Check with them. Their paper from SC08 is here: http://mc.stanford.edu/cgi-bin/images/8/8a/SC08_NAMD.pdf charlie From eugen at leitl.org Mon Jun 15 08:07:43 2009 From: eugen at leitl.org (Eugen Leitl) Date: Mon, 15 Jun 2009 17:07:43 +0200 Subject: [Beowulf] Data Center Overload Message-ID: <20090615150743.GZ23524@leitl.org> http://www.nytimes.com/2009/06/14/magazine/14search-t.html?_r=1&ref=magazine&pagewanted=print Data Center Overload By TOM VANDERBILT It began with an Xbox game. On a recent rainy evening in Brooklyn, I was at a friend?s house playing (a bit sheepishly, given my incipient middle age) Call of Duty: World at War. Scrolling through the game?s menus, I noticed a screen for Xbox Live, which allows you to play against remote users via broadband. The number of Call of Duty players online at that moment? More than 66,000. Walking home, I ruminated on the number. Sixty-six thousand is the population of a small city ? Muncie, Ind., for one. Who and where was this invisible metropolis? What infrastructure was needed to create this city of ether? We have an almost inimical incuriosity when it comes to infrastructure. It tends to feature in our thoughts only when it?s not working. The Google search results that are returned in 0.15 seconds were once a stirring novelty but soon became just another assumption in our lives, like the air we breathe. Yet whose day would proceed smoothly without the computing infrastructure that increasingly makes it possible to navigate the world and our relationships within it? Much of the daily material of our lives is now dematerialized and outsourced to a far-flung, unseen network. The stack of letters becomes the e-mail database on the computer, which gives way to Hotmail or Gmail. The clipping sent to a friend becomes the attached PDF file, which becomes a set of shared bookmarks, hosted offsite. The photos in a box are replaced by JPEGs on a hard drive, then a hosted sharing service like Snapfish. The tilting CD tower gives way to the MP3-laden hard drive which itself yields to a service like Pandora, music that is always ?there,? waiting to be heard. But where is ?there,? and what does it look like? ?There? is nowadays likely to be increasingly large, powerful, energy-intensive, always-on and essentially out-of-sight data centers. These centers run enormously scaled software applications with millions of users. To appreciate the scope of this phenomenon, and its crushing demands on storage capacity, let me sketch just the iceberg?s tip of one average individual digital presence: my own. I have photos on Flickr (which is owned by Yahoo, so they reside in a Yahoo data center, probably the one in Wenatchee, Wash.); the Wikipedia entry about me dwells on a database in Tampa, Fla.; the video on YouTube of a talk I delivered at Google?s headquarters might dwell in any one of Google?s data centers, from The Dalles in Oregon to Lenoir, N.C.; my LinkedIn profile most likely sits in an Equinix-run data center in Elk Grove Village, Ill.; and my blog lives at Modwest?s headquarters in Missoula, Mont. If one of these sites happened to be down, I might have Twittered a complaint, my tweet paying a virtual visit to (most likely) NTT America?s data center in Sterling, Va. And in each of these cases, there would be at least one mirror data center somewhere else ? the built-environment equivalent of an external hard drive, backing things up. Small wonder that this vast, dispersed network of interdependent data systems has lately come to be referred to by an appropriately atmospheric ? and vaporous ? metaphor: the cloud. Trying to chart the cloud?s geography can be daunting, a task that is further complicated by security concerns. ?It?s like ?Fight Club,? ? says Rich Miller, whose Web site, Data Center Knowledge, tracks the industry. ?The first rule of data centers is: Don?t talk about data centers.? Yet as data centers increasingly become the nerve centers of business and society ? even the storehouses of our fleeting cultural memory (that dancing cockatoo on YouTube!) ? the demand for bigger and better ones increases: there is a growing need to produce the most computing power per square foot at the lowest possible cost in energy and resources. All of which is bringing a new level of attention, and challenges, to a once rather hidden phenomenon. Call it the architecture of search: the tens of thousands of square feet of machinery, humming away 24/7, 365 days a year ? often built on, say, a former bean field ? that lie behind your Internet queries. INSIDE THE CLOUD Microsoft?s data center in Tukwila, Wash., sits amid a nondescript sprawl of beige boxlike buildings. As I pulled up to it in a Prius with Michael Manos, who was then Microsoft?s general manager of data-center services, he observed that while ?most people wouldn?t be able to tell this wasn?t just a giant warehouse,? an experienced eye could discern revelatory details. ?You would notice the plethora of cameras,? he said. ?You could follow the power lines.? He gestured to a series of fluted silver pipes along one wall. ?Those are chimney stacks, which probably tells you there?s generators behind each of those stacks.? The generators, like the huge banks of U.P.S. (uninterruptible power supply) batteries, ward against surges and power failures to ensure that the data center always runs smoothly. After submitting to biometric hand scans in the lobby and passing through a sensor-laden multidoor man trap, Manos and I entered a bright, white room filled with librarylike rows of hulking, black racks of servers ? the dedicated hardware that drives the Internet. The Tukwila data center happens to be one of the global homes of Microsoft?s Xbox Live: within those humming machines exists my imagined city of ether. Like most data centers, Tukwila comprises a sprawling array of servers, load balancers, routers, fire walls, tape-backup libraries and database machines, all resting on a raised floor of removable white tiles, beneath which run neatly arrayed bundles of power cabling. To help keep servers cool, Tukwila, like most data centers, has a system of what are known as hot and cold aisles: cold air that seeps from perforated tiles in front is sucked through the servers by fans, expelled into the space between the backs of the racks and then ventilated from above. The collective din suggests what it must be like to stick your head in a Dyson Airblade hand dryer. Tukwila is less a building than a machine for computing. ?You look at a typical building,? Manos explained, ?and the mechanical and electrical infrastructure is probably below 10 percent of the upfront costs. Whereas here it?s 82 percent of the costs.? Little thought is given to exterior appearances; even the word ?architecture? in the context of a data center can be confusing: it could refer to the building, the network or the software running on the servers. Chris Crosby, a senior vice president with Digital Realty Trust, the country?s largest provider of data-center space, compares his company?s product to a car, an assembly-line creation complete with model numbers: ?The model number tells you how much power is available inside the facility.? He also talks about the ?industrialization of the data center,? in contrast to the so-called whiteboard model of server design, by which each new building might be drawn up from scratch. The data center, he says, is ?our railroad; it doesn?t matter what kind of train you put on it.? At Tukwila ? as at any big data center ? the computing machinery is supported by what Manos calls the ?back-of-the-house stuff?: the chiller towers, the miles of battery springs, the intricate networks of piping. There?s also what Manos calls ?the big iron,? the 2.5-megawatt, diesel-powered Caterpillar generators clustered at one end of a cavernous space known as the wind tunnel, through which air rushes to cool the generators. ?In reality, the cloud is giant buildings full of computers and diesel generators,? Manos says. ?There?s not really anything white or fluffy about it.? Tukwila is one of Microsoft?s smaller data centers (they number ?more than 10 and fewer than 100,? Manos told me with deliberate vagueness). In 2006, the company, lured by cheap hydropower, tax incentives and a good fiber-optic network, built a 500,000-plus-square-foot data center in Quincy, Wash., a small town three hours from Tukwila known for its bean and spearmint fields. This summer, Microsoft will open a 700,000-plus-square-foot data center ? one of the world?s largest ? in Chicago. ?We are about three to four times larger than when I joined the company? ? in 2004 ? ?just in terms of data-center footprint,? Debra Chrapaty, corporate vice president of Global Foundation Services at Microsoft, told me when I met with her at Microsoft?s offices in Redmond, Wash. Yet when it comes to a large company like Microsoft, it can be difficult to find out what any given data center is used for. The company, for reasons ranging from security to competitive advantage, won?t provide too much in the way of details, apart from noting that Quincy could hold 6.75 trillion photos. ?We support over 200 online properties with very large scale,? Chrapaty offered. ?And so when you think about Hotmail supporting 375 million users, or search supporting three billion queries a month, or Messenger supporting hundreds of millions of users, you can easily assume that those properties are very large properties for our company.? Thanks to the efforts of amateur Internet Kremlinologists, there are occasional glimpses behind the silicon curtain. One blogger managed to copy a frame from a 2008 video of a Microsoft executive?s PowerPoint presentation showing that the company had nearly 150,000 servers (a number that presumably would now be much higher, given an estimated monthly server growth of 10,000) and that nearly 80,000 of those were used by its search application, now called Bing. When I discussed the figures with her, Chrapaty would only aver, crisply, that ?in an average data center, it?s not uncommon for search to take up a big portion of a facility.? THE RISE OF THE MEGA-DATA CENTER Data centers were not always unmarked, unassuming and highly restricted places. In the 1960s, in fact, huge I.B.M. mainframe computers commanded pride of place in corporate headquarters. ?It was called the glasshouse,? says Kenneth Brill, founder of the Uptime Insccess to the same application as a customer that has 65,000 seats, like Starbucks or Dell,? Adam Gross, vice president of platform marketing with salesforce.com, told me at the company?s offices in San Francisco. By contrast, just a few years ago, he went on, ?if you were to attack a really large problem, like delivering a C.R.M. application to 50,000 companies, or serving every single song ever, it really sort of felt outside your domain unless you were one of the largest companies in the world. There are these architectures now available for anybody to really attack these massive-scale kinds of problems.? And while most companies still maintain their own data centers, the promise is that instead of making costly investments in redundant I.T. hardware, more and more companies will tap into the utility-computing grid, piggybacking on the infrastructures of others. Already, Amazon Web Services makes available, for a fee, the company?s enormous computing power to outside customers. The division already uses more bandwidth than Amazon?s extensive retailing operations, while its Simple Storage Service holds some 52 billion virtual objects. ?We used to think that owning factories was an important piece of a business?s value,? says Bryan Doerr, the chief technology officer of Savvis, which provides I.T. infrastructure and what the company calls ?virtualized utility services? for companies like Hallmark. ?Then we realized that owning what the factory produces is more important.? THE ANNIHILATION OF SPACE BY TIME For companies like Google, Yahoo and, increasingly, Microsoft, the data center is the factory. What these companies produce are services. It was the increasing ?viability of a service-based model,? as Ray Ozzie, now the chief software architect at Microsoft, put it in 2005 ? portended primarily by Google and its own large-scale network of data centers ? that set Microsoft on its huge data-center rollout: if people no longer needed desktop software, they would no longer need Microsoft. This realization brought new prominence to the humble infrastructure layer of the data center, an aspect of the business that at Microsoft, as at most tech companies, typically escaped notice ? unless it wasn?t working. Data centers have now become, as Debra Chrapaty of Microsoft puts it, a ?true differentiator.? Indeed, the number of servers in the United States nearly quintupled from 1997 to 2007. (Kenneth Brill of the Uptime Institute notes that the mega-data centers of Google and its ilk account for only an estimated 5 percent of the total market.) The expansion of Internet-driven business models, along with the data retention and compliance requirements of a variety of tighter accounting standards and other financial regulations, has fueled a tremendous appetite for data-center space. For a striking example of how our everyday clicks and uploads help drive and shape this real world real estate, consider Facebook. Facebook?s numbers are staggering. More than 200 million users have uploaded more than 15 billion photos, making Facebook the world?s largest photo-sharing service. This expansion has required a corresponding infrastructure push, with an energetic search for financing. ?We literally spend all our time figuring how to keep up with the growth,? Jonathan Heiliger, Facebook?s vice president of technical operations, told me in a company conference room in Palo Alto, Calif. ?We basically buy space and power.? Facebook, he says, is too large to rent space in a managed ?co-location facility,? yet not large enough to build its own data centers. ?Five years ago, Facebook was a couple of servers under Mark?s desk in his dorm room,? Heiliger explained, referring to Mark Zuckerberg, Facebook?s founder. ?Then it moved to two sorts of hosting facilities; then it graduated to this next category, taking a data center from an R.E.I.T.? ? real estate investment trust ? ?in the Bay Area and then basically continued to expand that. We now have a fleet of data centers.? A big challenge for Facebook, or any Internet site with millions of users, is ?scalability? ? ensuring that the infrastructure will keep working as new applications or users are added (often in incredibly spiky fashion, as when Oprah Winfrey joined and immediately garnered some 170,000 friends). Another issue is determining where Facebook?s data centers are located, where its users are located and the distance between them ? what is called latency. Though the average user might not appreciate it, a visit to Facebook may involve dozens of round trips between a browser and any number of the site?s servers. In 2007, Facebook opened a third data center in Virginia to expand its capacity and serve its increasing number of users in Europe and elsewhere. ?If you?re in the middle of the country, the differences are pretty minor whether you go to California or Virginia,? Heiliger said. But extend your communications to, say, India, and delay begins to compound. Bits, limited by the laws of physics, can travel no faster than the speed of light. To hurry things up, Facebook can try to reduce the number of round trips, or to ?push the data as close to a user as possible? (by creating new data centers), or to rely on content-data networks that store commonly retrieved data in Internet points of presence (POPs) around the world. While an anxious Facebook user serially refreshing to see if a friend has replied to an invitation might seem the very picture of the digital age?s hunger for instantaneity, to witness a true imperative for speed, you must visit NJ2, a data center located in Weehawken, N.J., just through the Lincoln Tunnel from Manhattan. There, in an unmarked beige complex with smoked windows, hum the trading engines of several large financial exchanges including, until recently, the Philadelphia Stock Exchange (it was absorbed last year by Nasdaq). NJ2, owned by Digital Realty Trust, is managed by Savvis, which provides ?proximity hosting? ? enabling financial companies to be close to the market. At first I took this to mean proximity to Wall Street, but I soon learned that it meant proximity of the financial firms? machines to the machines of the trading exchanges in NJ2. This is desirable because of the rise of electronic exchanges, in which machine-powered models are, essentially, competing against other machine-powered models. And the temporal window for such trading, which is projected this year by Celent to account for some 46 percent of all U.S. trading volume, is growing increasingly small. ?It used to be that things were done in seconds, then milliseconds,? Varghese Thomas, Savvis?s vice president of financial markets, told me. Intervening steps ? going through a consolidated ticker vendor like Thomson Reuters ? added 150 to 500 milliseconds to the time it takes for information to be exchanged. ?These firms said, ?I can eliminate that latency much further by connecting to the exchanges directly,? ? Thomas explained. Firms initially linked from their own centers, but that added precious fractions of milliseconds. So they moved into the data center itself. ?If you?re in the facility, you?re eliminating that wire.? The specter of infinitesimal delay is why, when the Philadelphia Stock Exchange, the nation?s oldest, upgraded its trading platform in 2006, it decided to locate the bulk of its trading engines 80 miles ? and three milliseconds ? from Philadelphia, and into NJ2, where, as Thomas notes, the time to communicate between servers is down to a millionth of a second. (Latency concerns are not limited to Wall Street; it is estimated that a 100-millisecond delay reduces Amazon?s sales by 1 percent.) At NJ2, a room hosting one of the exchanges (I agreed not to say which, for security reasons), housed, in typical data-center fashion, rows of loudly humming black boxes, whose activity was literally inscrutable. This seemed strangely appropriate; after all, as Thomas pointed out, the data center hosts a number of ?dark pools,? or trading regimens that allow the anonymous buying and selling of small amounts of securities at a time, so as not, as Thomas puts it, ?to create ripples in the market.? It seemed heretical to think of Karl Marx. But looking at the roomful of computers running automated trading models that themselves scan custom-formatted machine-readable financial news stories to help make decisions, you didn?t have to be a Marxist to appreciate his observation that industry will strive to ?produce machines by means of machines? ? as well as his prediction that the ?more developed the capital,? the more it would seek the ?annihilation of space by time.? THE COST OF THE CLOUD Data centers worldwide now consume more energy annually than Sweden. And the amount of energy required is growing, says Jonathan Koomey, a scientist at Lawrence Berkeley National Laboratory. From 2000 to 2005, the aggregate electricity use by data centers doubled. The cloud, he calculates, consumes 1 to 2 percent of the world?s electricity. Much of this is due simply to growth in the number of servers and the Internet itself. A Google search is not without environmental consequence ? 0.2 grams of CO2 per search, the company claims ? but based on E.P.A. assumptions, an average car trip to the library consumes some 4,500 times the energy of a Google search while a page of newsprint uses some 350 times more energy. Data centers, however, are loaded with inefficiencies, including loss of power as it is distributed through the system. It has historically taken nearly as much wattage to cool the servers as it does to run them. Many servers are simply ?comatose.? ?Ten to 30 percent of servers are just sitting there doing nothing,? Koomey says. ?Somebody in some department had a server doing this unique thing for a while and then stopped using it.? Because of the complexity of the network architecture ? in which the role of any one server might not be clear or may have simply been forgotten ? turning off a server can create more problems (e.g., service outages) than simply leaving it on. As servers become more powerful, more kilowatts are needed to run and cool them; square footage in data centers is eaten up not by servers but by power. As data centers grow to unprecedented scales ? Google recently reported that one of its data centers holds more than 45,000 servers (only a handful of companies have that many total servers) ? attention has shifted to making servers less energy intensive. One approach is to improve the flow of air in the data center, through computational fluid-dynamics modeling. ?Each of these servers could take input air at about 80 degrees,? John Sontag, director of the technology transfer office at Hewlett-Packard, told me as we walked through the company?s research lab in Palo Alto. ?The reason why you run it at 57 is you?re not actually sure you can deliver cold air? everywhere it is needed. Chandrakant Patel, director of the Sustainable I.T. Ecosystem Lab at H.P., argues there has been ?gross overprovisioning? of cooling in data centers. ?Why should all the air-conditioners run full time in the data center?? he asks. ?They should be turned down based on the need.? Power looms larger than space in the data center?s future ? the data-center group Afcom predicts that in the next five years, more than 90 percent of companies? data centers will be interrupted at least once because of power constrictions. As James Hamilton of Amazon Web Services observed recently at a Google-hosted that are very hard to realize in a standard rack-mount environment.? The containers ? which are pre-equipped with racks of servers and thus are essentially what is known in the trade as plug-and-play ? are shipped by truck direct from the original equipment manufacturer and attached to a central spine. ?You can literally walk into that building on the first floor and you?d be hard pressed to tell that building apart from a truck-logistics depot,? says Manos, who has since left Microsoft to join Digital Realty Trust. ?Once the containers get on site, we plug in power, water, network connectivity, and the boxes inside wake up, figure out which property group they belong to and start imaging themselves. There?s very little need for people.? ?Our perspective long term is: It?s not a building, it?s a piece of equipment,? says Daniel Costello, Microsoft?s director of data-center research, ?and the enclosure is not there to protect human occupancy; it?s there to protect the equipment.? >From here, it is easy to imagine gradually doing away with the building itself, and its cooling requirements, which is, in part, what Microsoft is doing next, with its Gen 4 data center in Dublin. One section of the facility consists of a series of containers, essentially parked and stacked amid other modular equipment ? with no roof or walls. It will use outside air for cooling. On our drive to Tukwila, Manos gestured to an electrical substation, a collection of transformers grouped behind a chain-link fence. ?We?re at the beginning of the information utility,? he said. ?The past is big monolithic buildings. The future looks more like a substation ? the data center represents the information substation of tomorrow.? Tom Vanderbilt is the author of ?Traffic: Why We Drive the Way We Do (and What It Says About Us).? From gerry.creager at tamu.edu Mon Jun 15 10:45:34 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon, 15 Jun 2009 12:45:34 -0500 Subject: [Beowulf] MPI + CUDA codes In-Reply-To: <0C87BBD6-0902-4BB4-9A70-2CC0446390E9@cs.earlham.edu> References: <9F81E066891C4656A542A97A9FE543A6@thakurlaptop> <9D0E7F59-4D7D-40C5-B104-5463260DB824@umich.edu> <0C87BBD6-0902-4BB4-9A70-2CC0446390E9@cs.earlham.edu> Message-ID: <4A3688BE.7050501@tamu.edu> Charlie Peck wrote: > On Jun 12, 2009, at 7:54 PM, Brock Palen wrote: > >> I think the Namd folks had a paper and data from real running code at >> SC last year. Check with them. > > Their paper from SC08 is here: > > http://mc.stanford.edu/cgi-bin/images/8/8a/SC08_NAMD.pdf Michalakes et al: http://www.google.com/url?sa=t&source=web&ct=res&cd=3&url=http%3A%2F%2Fwww.mmm.ucar.edu%2Fwrf%2FWG2%2Fmichalakes_lspp.pdf&ei=DYg2StDxJ5HSMPv0lYoK&usg=AFQjCNGI-ajaVr3gHPbKBGomgX_SvkaQbg and benchmarks at http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Finsidehpc.com%2F2008%2F09%2F02%2Fcuda-physics-in-wrf%2F&ei=DYg2StDxJ5HSMPv0lYoK&usg=AFQjCNEEZXh544XlFKUQLxOV8lpDkdZx9w gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From hearnsj at googlemail.com Mon Jun 15 10:59:37 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 15 Jun 2009 18:59:37 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization Message-ID: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> I was doing a search on ganglia + ipmi (I'm looking at doing such a thing for temperature measurement) when I cam across this paper: http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf Proactive Fault Tolerance for HPC using Xen virtualization Its something I've wanted to see working - doing a Xen live migration of a 'dodgy' compute node, and the job just keeps on trucking. Looks as if these guys have it working. Anyone else seen similar? John Hearns From mdidomenico4 at gmail.com Mon Jun 15 11:47:40 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 15 Jun 2009 14:47:40 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> Message-ID: On Mon, Jun 15, 2009 at 1:59 PM, John Hearns wrote: > Proactive Fault Tolerance for HPC using Xen virtualization > > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyone else seen similar? I haven't seen it in the field yet, but I had hoped to do something similar with a cluster this summer. I hadn't seen the above paper before, but I was basing my test on some papers I'd seen about using Xen with cloud computing initiatives (ala AWS or Eucalyptus). Ideally I'd like to see Infiniband worked into the mist, so that I could use high speed messaging within the xen images and then live migrate an image as need arises. DK Panda has a paper that shows a little bit of this, but details are far and few. It would be nice to be able to just move bad hardware out from under a running job without affecting the run of the job. If it took an extra ten minutes for the job to run because of a migration i think thats a small price to pay for actually having the run go to completion and not have to worry some much about checkpoints. Course having said all that, if you've been watching the linux-kernel mailing list you've probably noticed the Xen/Kvm/Linux HV argument that took place last week. Makes me a little afraid to push any Linux HV solution into to production, but it's a fun experiment none the less... From hearnsj at googlemail.com Mon Jun 15 12:58:58 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 15 Jun 2009 20:58:58 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> Message-ID: <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> 2009/6/15 Michael Di Domenico : > > > Course having said all that, if you've been watching the linux-kernel > mailing list you've probably noticed the Xen/Kvm/Linux HV argument > that took place last week. ?Makes me a little afraid to push any Linux > HV solution into to production, but it's a fun experiment none the > less... Could you please give us a summary? Really would be appreciated - I don't follow those lists that closely. I spent the last week in a small room with lots of racks, cables and blinking lights! From kilian.cavalotti.work at gmail.com Tue Jun 16 01:38:55 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Tue, 16 Jun 2009 10:38:55 +0200 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> Message-ID: <200906161038.56577.kilian.cavalotti.work@gmail.com> On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: > It would be nice to be able to just move bad hardware out from under a > running job without affecting the run of the job. I may be missing something major here, but if there's bad hardware, chances are the job has already failed from it, right? Would it be a bad disk (and the OS would only notice a bad disk while trying to write on it, likely asked to do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware losing bits mainly manifests itself in software errors. There is very little chance to spot a bad DIMM until something (like a job) tries to write to it. So unless there's a way to detect faulty hardware before it affects anything software, it's very likely that the job would have crashed already, before the OS could pull out its migration toolkit. The paper John mentioned is centered around IPMI for preventive fault detection. It probably works for some cases (where you can use thresholds, like temperature probes or fan speeds), where IPMI detects hardware errors before it affects the running job. But from what I've seen most often, it's kind of too late, and IPMI logs an error when the job has crashed already. And even if it didn't crash yet, what kind of assurance to you have that the result of simulation has not been corrupted in some way by that faulty DIMM you got? My take on this is that it's probably more efficient to develop checkpointing features and recovery in software (like MPI) rather than adding a virtualization layer, which is likely to decrease performance. Cheers, -- Kilian From hearnsj at googlemail.com Tue Jun 16 02:02:11 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 16 Jun 2009 10:02:11 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <200906161038.56577.kilian.cavalotti.work@gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> Message-ID: <9f8092cc0906160202m274ad417pa9f3da905c17799d@mail.gmail.com> 2009/6/16 Kilian CAVALOTTI > > > I may be missing something major here, but if there's bad hardware, chances > are the job has already failed from it, right? Would it be a bad disk (and > the > OS would only notice a bad disk while trying to write on it, likely asked > to > do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything > hardware > losing bits mainly manifests itself in software errors. There is very > little > chance to spot a bad DIMM until something (like a job) tries to write to > it. What you say is very true. However, you could look of correctable ECC errors, and for disks run a smartctl test and see if a disk is showing symtopms which might make it fail in future. Or maybe look at the error rates on your ethernet or infiniband interface - you might want to take that node out till it can be investigated (read- reseating the cable!) -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at googlemail.com Tue Jun 16 02:05:07 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 16 Jun 2009 10:05:07 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <200906161038.56577.kilian.cavalotti.work@gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> Message-ID: <9f8092cc0906160205t18663ec0l59840058566ca565@mail.gmail.com> 2009/6/16 Kilian CAVALOTTI > My take on this is that it's probably more efficient to develop > checkpointing > features and recovery in software (like MPI) rather than adding a > virtualization layer, which is likely to decrease performance. > The performance hits measured by Panda et. al. on Infiniband connected hardware are of the order of 5 percent (I may be wrong here). I believe that if we can get features like live migration of failing machines, plus specialized stripped-down virtual machines specific to job types then we will see virtualization becoming mainstream in HPC clustering. -------------- next part -------------- An HTML attachment was scrubbed... URL: From d.love at liverpool.ac.uk Tue Jun 16 03:01:42 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 16 Jun 2009 11:01:42 +0100 Subject: [Beowulf] Re: HPC fault tolerance using virtualization References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> Message-ID: <87vdmwv6nt.fsf@liv.ac.uk> John Hearns writes: > I was doing a search on ganglia + ipmi (I'm looking at doing such a > thing for temperature measurement) Like ? If you want to take action, though, go direct to Nagios or similar with sensor readings, chassis health data, etc. > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyone else seen similar? I don't understand what's wrong with using MPI fault tolerance. I recall testing LAM+BLCR and having processes migrate when SGE host queues were suspended, but I'm not in a position to try the Open-MPI version. Nothing short of checkpoints will help, anyway, when the node just dies, and that's the problem we see most often (e.g. because we were sold a shambolic Barcelona system with flaky hardware and an OS that doesn't support quad core properly). How does Xen perform generally, anyhow? Are there useful data on the HPC performance impact of Xen and/or KVM for, say, Ethernet NUMA systems? I've only seen it for non-NUMA Infiniband systems. From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 16 03:27:18 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue, 16 Jun 2009 12:27:18 +0200 (CEST) Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906160205t18663ec0l59840058566ca565@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160205t18663ec0l59840058566ca565@mail.gmail.com> Message-ID: On Tue, 16 Jun 2009, John Hearns wrote: > I believe that if we can get features like live migration of failing > machines, plus specialized stripped-down virtual machines specific > to job types then we will see virtualization becoming mainstream in > HPC clustering. You might be right, at least when talking about the short term. It has been my experience with several ISVs that they are very slow in adopting newer features related to system infrastructure in their software - by system infrastructure I mean anything that has to do with the OS (f.e. taking advantage of CPU/mem affinity), MPI lib, queueing system, etc. So even if the MPI lib will gain features to allow fault tolerance, it will take a long time until they will be in real-world use. By comparison, virtualization is something that the ISVs can completely offload to the sysadmins or system integrators, because neither the application nor the MPI lib (which is sometimes linked in the executable...) will have to be aware of it. The ISVs can then even choose what virtualization solution they "support". Another aspect, which I have already mentioned some time ago, is that the ISV can much easier force the usage of a particular OS and environment, because this runs in the VM and is independent of what runs on the host. They can even provide a VM image which includes the OS, environment and application and declare this as the only supported configuration... this is done already for non-parallel applications, but there's only one step needed for parallel ones: adapting it to the underlying network to get the HPC level of performance. I think that adapting to the queueing system is not really necessary from inside the VM; the queueing system can either start one VM per core or start one VM with several virtual CPUs to fill the number of processing elements (or slots) allocated for the job on the node - if the VM is able to adapt itself to such a situation, f.e. by starting several MPI ranks and using shared memory for MPI communication. Further, to cleanly stop the job, the queueing system will have to stop the VMs, sending first a "shutdown" and then a "destroy" command, similar to sending SIGTERM and SIGKILL today. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From mdidomenico4 at gmail.com Tue Jun 16 05:42:36 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 16 Jun 2009 08:42:36 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> Message-ID: On Mon, Jun 15, 2009 at 3:58 PM, John Hearns wrote: > 2009/6/15 Michael Di Domenico : >> Course having said all that, if you've been watching the linux-kernel >> mailing list you've probably noticed the Xen/Kvm/Linux HV argument >> that took place last week. Makes me a little afraid to push any Linux >> HV solution into to production, but it's a fun experiment none the >> less... > > Could you please give us a summary? Really would be appreciated - I > don't follow those lists that closely. > I spent the last week in a small room with lots of racks, cables and > blinking lights! it was a little hard to follow as it turned into six arguments in one thread like most mailing list arguments tend to do. as best i could tell there were three major things going on 1. whether to not to turn the linux kernel into a hypervisor or keep things as they are with linux feeding scheduling and drivers to the domains 2. whether or not to add a xen branch to the linux kernel and make xen a formal branch. the xen guys claim this would require a long time and too many changes on their part to fit the kernel build/coding process 3. general code quality with the xen patches. linus showed the git diff history between kvm and xen. apparently xen pollutes the kernel code base something fierce and it makes people unhappy. apparently a lot of xen patches get rejected i'm not positive i've surmised everything that happened, but thats what i took away from the threads. From ashley at pittman.co.uk Tue Jun 16 07:06:18 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 16 Jun 2009 15:06:18 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160205t18663ec0l59840058566ca565@mail.gmail.com> Message-ID: <1245161178.25546.1293.camel@localhost.localdomain> On Tue, 2009-06-16 at 12:27 +0200, Bogdan Costescu wrote: > You might be right, at least when talking about the short term. It has > been my experience with several ISVs that they are very slow in > adopting newer features related to system infrastructure in their > software - by system infrastructure I mean anything that has to do > with the OS (f.e. taking advantage of CPU/mem affinity), MPI lib, > queueing system, etc. So even if the MPI lib will gain features to > allow fault tolerance, it will take a long time until they will be in > real-world use. This is true, ISV's like to statically link everything, lock things down as much as possible and then rubber-stamp it as "supported". > By comparison, virtualisation is something that the ISVs can > completely offload to the sysadmins or system integrators, because > neither the application nor the MPI lib (which is sometimes linked in > the executable...) will have to be aware of it. The ISVs can then even > choose what virtualization solution they "support". So it's good for ISVs. It's bad for the sysadmins, it's bad for the system integrators and it's bad for the end users. > Another aspect, which I have already mentioned some time ago, is that > the ISV can much easier force the usage of a particular OS and > environment, because this runs in the VM and is independent of what > runs on the host. They can even provide a VM image which includes the > OS, environment and application and declare this as the only supported > configuration... This is frankly an insane way of doing things, the only justification I can find for doing it is that ISV code is flaky and breaks if things like say a network driver change underneath them. The correct answer for this is obviously to write better quality software, a job made easier in the open-source world where it's a lot easier to re-compile code should there be a underlying change in the OS. > this is done already for non-parallel applications, I'll believe it, it's driven from the windows world (as is much of the virtualisation hype) where it really is only possible to run one service per OS instance so any complex set of software requires N underutilised computers to function properly. What virtualisation does is allow you to run these N underutilised computers on one single computer. > but there's only one step needed for parallel ones: adapting it to the > underlying network to get the HPC level of performance. I think that > adapting to the queueing system is not really necessary from inside > the VM; the queueing system can either start one VM per core or start > one VM with several virtual CPUs to fill the number of processing > elements (or slots) allocated for the job on the node - if the VM is > able to adapt itself to such a situation, f.e. by starting several MPI > ranks and using shared memory for MPI communication. Further, to > cleanly stop the job, the queueing system will have to stop the VMs, > sending first a "shutdown" and then a "destroy" command, similar to > sending SIGTERM and SIGKILL today. So HPC as an industry can invest serious amounts of effort and time converting cluster software into a model where the "application" is really a closed-box virtual image that we simply "start" a number instances of and wait for it to shutdown after itself. This is superficially great for ISVs as it's just like running in the cloud and it "only" costs 5% in performance (which incidentally I don't believe for a minute) it's a long was from "High Performance" however, perhaps we should coin the phrase "Medium Performance Computing" for this model? Maybe, just maybe for most people it's good enough. At moderate scale, up to a couple of hundred cores this is acceptable, cores are cheap enough (by the hour) that simply paying for the 20% more cores to recoup your 5% performance loss is not an issue, scaling a little bit further isn't beyond the capability of many applications and if you've simplified things elsewhere then the savings will pay for those extra cores anyway. The real losers here however are those people doing things at a scale where you can't just chuck hardware at the problem and where you do actually care about underlying performance, the traditional HPC crowd who, lets be honest, are the ones with the money and the talent anyway. It's as though HPC has gone or is infiltrating mainstream whilst at the same time mainstream computing is jumping into the cloud. All of a sudden HPC doesn't fit into the mainstream model any more (not that it ever really did IMHO) and all the recent converts are sat there in a cloud of hot air looking back at us in bewilderment. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From hearnsj at googlemail.com Tue Jun 16 07:21:53 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 16 Jun 2009 15:21:53 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <1245161178.25546.1293.camel@localhost.localdomain> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160205t18663ec0l59840058566ca565@mail.gmail.com> <1245161178.25546.1293.camel@localhost.localdomain> Message-ID: <9f8092cc0906160721n59d209brd63f3e840ef11aea@mail.gmail.com> 2009/6/16 Ashley Pittman > > > > elements (or slots) allocated for the job on the node - if the VM is > > able to adapt itself to such a situation, f.e. by starting several MPI > > ranks and using shared memory for MPI communication. Further, to > > cleanly stop the job, the queueing system will have to stop the VMs, > > sending first a "shutdown" and then a "destroy" command, similar to > > sending SIGTERM and SIGKILL today. > I will provide a counter-example here - I think that a lot of people have thought about re-booting nodes every time they finish a job. There are codes out there which leave processes running, or leave shared memory segments, if the code is not properly terminated. I think everyone has had to run clean-ipcs at some time! Yes, you're right, the codes should be written properly and should not do this. However it is very tempting to put a reboot in as a step following every job, which means you get a machine in a known state for the next job. Running virtual machines will make that easy (depends how long they take to boot up) I agree with you about the 5% figure - the point I was making is that there will come a point where the advantages of running a virtual machine will outweigh a few percent of performance loss. Who knows where that point will be! -------------- next part -------------- An HTML attachment was scrubbed... URL: From egan at sense.net Tue Jun 16 09:08:23 2009 From: egan at sense.net (Egan Ford) Date: Tue, 16 Jun 2009 10:08:23 -0600 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> Message-ID: <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> The good news... We (IBM) demonstrated such a system at SC08 as a Cloud Computing demo. The setup was a combination of Moab, xCAT, and Xen. xCAT is an open source provisioning system that can control/monitor hardware, discover nodes, and provision stateful/stateless physical nodes and virtual machines. xCAT supports both KVM and Xen including live migration. Moab was used as a scheduler with xCAT as one of Moab's resource managers. Moab uses xCAT's SSL/XML interface to query node state and to tell xCAT what to do. Some of the things you can do: 1. Green computing. Provision nodes on-demand as needed with any OS (Windows too). E.g. Torque command line: qsub -l nodes=10:ppn=8,walltime=10:00:00,os=rhimagea. Idle rhimagea nodes will be reused, other idle or off nodes will be provisioned with rhimagea. When Torque checks in the job starts. For this to be efficient all node images including hypervisor images should be stateless. For Windows we use preinstalled iSCSI images (xCAT uses gpxe to simulate iSCSI HW on any x86_64 node). When nodes are idle for more than 10 minutes Moab instructs xCAT to power off the nodes (unless something in the queue will use them soon). Since it's stateless there is no need for cleanup. I have this running on a 3780 diskless node system today. 2. Route around problems. If a dynamic provision fails, it will try another node. Moab can also query xCAT about the HW health of the machine and opt to avoid using nodes that have an "amber" light. Excessive ECCs, over temp, etc... are events that our service processors log. If a threshold is reached the node is marked "risky", or "doomed to fail". Moab policies can be setup to determine how to handle nodes in this state, e.g. Local MPI jobs--no risky nodes. Grid jobs from another University--ok to use risky nodes. Or, setup a reservation and email someone to fix it. 3. Virtual machine balancing. Since xCAT can live migrate Xen, KVM, (and soon ESX4) and since it provides a programmable interface, Moab has no problem moving VMs around based on policies. Combine this with the above two examples and you can move VMs if a HW warning is issued. You can enable green to consolidate VMs and power off nodes. You can query xCAT for node temp and do thermal balancing. The above is just a few ideas that we are pursuing with our customers today. The bad news... I have no idea the state of VMs on IB. That can be an issue with MPI. Believe it or not, but most HPC sites do not use MPI. They are all batch systems where storage I/O is the bottleneck. However, I have tested MPI over IP with VMs and moved things around. No problem. Hint: You will need a large L2 network since the VMs retain their MAC and IP. Yes there are workarounds, but nothing as easy as a large L2. Application performance may suffer in a VM. Benchmark first. If you just use #1 and #2 above on the iron, you can decrease your risk of failure and run faster. And we all check point, right? :-) Lastly checkout http://lxc.sourceforge.net/. This is light weight virtualization. Its not a new concept, but hopefully by next year automated check point/restart with MPI jobs over IB may be supported. This may be a better fit for HPC than full-on virtualization. On Mon, Jun 15, 2009 at 11:59 AM, John Hearns wrote: > I was doing a search on ganglia + ipmi (I'm looking at doing such a > thing for temperature measurement) when I cam across this paper: > > http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf > > Proactive Fault Tolerance for HPC using Xen virtualization > > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyone else seen similar? > > John Hearns > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg at keller.net Tue Jun 16 09:14:04 2009 From: Greg at keller.net (Greg Keller) Date: Tue, 16 Jun 2009 11:14:04 -0500 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <200906161408.n5GE8LIS004808@bluewest.scyld.com> References: <200906161408.n5GE8LIS004808@bluewest.scyld.com> Message-ID: <09A2D5A4-9110-4BDD-B141-675F9EA07F86@Keller.net> > > > Date: Tue, 16 Jun 2009 10:38:55 +0200 > From: Kilian CAVALOTTI > > On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: >> It would be nice to be able to just move bad hardware out from >> under a >> running job without affecting the run of the job. > > I may be missing something major here, but if there's bad hardware, > chances > are the job has already failed from it, right? Would it be a bad > disk (and the > OS would only notice a bad disk while trying to write on it, likely > asked to > do so by the job), or bad memory, or bad CPU, or faulty PSU. > Anything hardware > losing bits mainly manifests itself in software errors. There is > very little > chance to spot a bad DIMM until something (like a job) tries to > write to it. We have recently purchased "un-blade" systems that may fit into the missing list. These are systems where multiple nodes are hard wired into a single chassis and in order to work on 1, all of them have to come offline. The power efficiency and system costs are compelling, but the complexity of maintenance is a trade off we decided to try. If the Virtualization tax was low enough it would be useful, and make us more incented to use these more cost/power efficient options without creating huge maintenance hassles. > > > So unless there's a way to detect faulty hardware before it affects > anything > software, it's very likely that the job would have crashed already, > before the > OS could pull out its migration toolkit. IF the job is running against a large Networked File System, but the local *Real* OS is depending on the failing disk, the job could be migrated off when the OS starts detecting SCSI or Network (IB?) errors. Same is true for some network issues. Of course, who in their right mind would want an OS dependent on a local disk these days? :) Note: this is a Shameless plug for Perceus and all other such options that leave spinning disk for scratch/checkpointing or some other lower risk purpose... if any. > > > The paper John mentioned is centered around IPMI for preventive fault > detection. It probably works for some cases (where you can use > thresholds, > like temperature probes or fan speeds), where IPMI detects hardware > errors > before it affects the running job. But from what I've seen most > often, it's > kind of too late, and IPMI logs an error when the job has crashed > already. And > even if it didn't crash yet, what kind of assurance to you have that > the > result of simulation has not been corrupted in some way by that > faulty DIMM > you got? Single Bit Errors likely won't corrupt the system, but it would be nice to handle them when the pop up, rather than waiting for maintenance windows or offlining the node and waiting for any jobs to drain off of it. This would be a win for an admin to do maintenance on their own schedules and minimize the actual lost compute time of the machine. > > > My take on this is that it's probably more efficient to develop > checkpointing > features and recovery in software (like MPI) rather than adding a > virtualization layer, which is likely to decrease performance. I agree. I was very excited about "Evergrid"'s (Now Librato?) notion of universal checkpointing... but I've never been able to get any time from/with them. This seems like an approach for checkpointing that would work out very cleanly for many apps that are clueless on the notion of checkpointing. Moral of the story: There was a day when the OS was a huge consumer of a workstations resources (CPU/Memory/Disk) and as such a huge Tax. Today it's a small fraction of the footprint, and so we worry less and less about it and it's efficiency except where it impacts the performance/ stability of the apps that depend on it. My guess is that Virtualization is just an extension of that trend and will eventually be the way we need to go as the Tax of the OS / VM layers becomes more minimal. Given that general trend, I am happy to see smart people who prefer graceful code and efficiency trying to steer these VM options toward low overhead solutions where I can firewall a bad code from leaving machines in a bad state for the next code that tries to run there. Should the applications be better stewards of the environment they run in... yes. Should the OS protect itself better from bad codes... yes. Should the admin configure the OS better so that codes can't do bad things... yes. But I can't control those, I can control my OS and give the apps their own OS via VM's if the Tax is low enough. Anything different would be like saying we shouldn't need firewalls because the apps listening to the ports shouldn't be hackable. It's true, but not something I want to try and control. > > > Cheers, > -- > Kilian > Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at googlemail.com Tue Jun 16 09:23:31 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 16 Jun 2009 17:23:31 +0100 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> Message-ID: <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> 2009/6/16 Egan Ford > I have no idea the state of VMs on IB. That can be an issue with MPI. > Believe it or not, but most HPC sites do not use MPI. They are all batch > systems where storage I/O is the bottleneck. Burn the Witch! Burn the Witch! Any HPC installation, if you want to show it off to alumni, august committees from grant awarding bodies etc. and not get sand kicked in your face from the big boys in the Top 500 NEEDS an expensive infrastructure of various MPI libraries. Big, big switches with lots of flashing lights. Highly paid, pampered systems admins who must be treated like expensive racehorses, and not exercised too much every day. They need cool beers on tap and luxurious offices to relax in while they prepare to do that vital half hours work per day which keeps your Supercomputer flashing away and making noises. -------------- next part -------------- An HTML attachment was scrubbed... URL: From egan at sense.net Tue Jun 16 10:05:17 2009 From: egan at sense.net (Egan Ford) Date: Tue, 16 Jun 2009 11:05:17 -0600 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> Message-ID: <258e18ac0906161005r3f3c26c3u6b93efda2be62eed@mail.gmail.com> Ha! :-) I've put a few GigE systems in the Top100, and if the stars align you'll see a Top20 GigE system in next weeks list. That's ONE GigE to each node oversubscribed 4:1. Sadly no flashing lights, and since its 100% water cooled with low velocity fans, there is almost no noise. On Tue, Jun 16, 2009 at 10:23 AM, John Hearns wrote: > > > 2009/6/16 Egan Ford > >> I have no idea the state of VMs on IB. That can be an issue with MPI. >> Believe it or not, but most HPC sites do not use MPI. They are all batch >> systems where storage I/O is the bottleneck. > > > Burn the Witch! Burn the Witch! > > Any HPC installation, if you want to show it off to alumni, august > committees from grant awarding bodies etc. and not get sand kicked in your > face from the big boys in the Top 500 NEEDS an expensive infrastructure of > various MPI libraries. Big, big switches with lots of flashing lights. > Highly paid, pampered systems admins who must be treated like expensive > racehorses, and not exercised too much every day. They need cool beers on > tap and luxurious offices to relax in while they prepare to do that vital > half hours work per day which keeps your Supercomputer flashing away and > making noises. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Tue Jun 16 10:13:29 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 16 Jun 2009 13:13:29 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> Message-ID: <4A37D2B9.2050203@scalableinformatics.com> John Hearns wrote: > Any HPC installation, if you want to show it off to alumni, august > committees from grant awarding bodies etc. and not get sand kicked in > your face from the big boys in the Top 500 NEEDS an expensive > infrastructure of various MPI libraries. Big, big switches with lots of > flashing lights. Highly paid, pampered systems admins who must be > treated like expensive racehorses, and not exercised too much every day. > They need cool beers on tap and luxurious offices to relax in while they > prepare to do that vital half hours work per day which keeps your > Supercomputer flashing away and making noises. And let us not forget ... the machine that goes "Bing!" (http://www.youtube.com/watch?v=arCITMfxvEc) My apologies to the squeamish amongst you ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jmdavis1 at vcu.edu Tue Jun 16 12:06:39 2009 From: jmdavis1 at vcu.edu (Mike Davis) Date: Tue, 16 Jun 2009 15:06:39 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> Message-ID: <4A37ED3F.1080008@vcu.edu> John Hearns wrote: > > > 2009/6/16 Egan Ford > > > I have no idea the state of VMs on IB. That can be an issue with > MPI. Believe it or not, but most HPC sites do not use MPI. They > are all batch systems where storage I/O is the bottleneck. > > > Burn the Witch! Burn the Witch! > > Any HPC installation, if you want to show it off to alumni, august > committees from grant awarding bodies etc. and not get sand kicked in > your face from the big boys in the Top 500 NEEDS an expensive > infrastructure of various MPI libraries. Big, big switches with lots > of flashing lights. Highly paid, pampered systems admins who must be > treated like expensive racehorses, and not exercised too much every > day. They need cool beers on tap and luxurious offices to relax in > while they prepare to do that vital half hours work per day which > keeps your Supercomputer flashing away and making noises. [rant] I realize that this is humor, but one must remember just how sensitive System Admins can be before making such statements. I would like to refer you to the BOFH (Bastard Operator from Hell) or as I like to call it the SysAdmins guide to interpersonal relationships. Remember what these people do and more importantly what they can do. On a serious note, who else get's out of bed at 3 am because an automated system indicates an issue with an HPC research cluster, or the Computing Center Calls because fresh water has been cut off and the building is warming, or you get the call that the water pumps (dual for redundancy but sharing one controller, now that's engineering) have failed, or that machine room power is dirty because 1/2 of the battery bank has shorted and the other half can't supply all of the needed clean power etc, etc. In my experience, Sysadmins don't want beer or luxurious offices they want the tools that they need, proper managerial support, and respect. [/rant] -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton From chekh at pcbi.upenn.edu Mon Jun 15 14:47:44 2009 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Mon, 15 Jun 2009 17:47:44 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> Message-ID: <20090615174744.01ffe1e8.chekh@pcbi.upenn.edu> On Mon, 15 Jun 2009 20:58:58 +0100 John Hearns wrote: > 2009/6/15 Michael Di Domenico : > > > > > > Course having said all that, if you've been watching the linux-kernel > > mailing list you've probably noticed the Xen/Kvm/Linux HV argument > > that took place last week. ?Makes me a little afraid to push any Linux > > HV solution into to production, but it's a fun experiment none the > > less... > > Could you please give us a summary? Really would be appreciated - I > don't follow those lists that closely. > I spent the last week in a small room with lots of racks, cables and > blinking lights! > I think this story from June 3 is related: http://lwn.net/Articles/335812/ The domU part of Xen is in the Linux kernel, but the dom0 and hypervisor parts of Xen are not part of the mainline kernel, so there are some extra hassles in running them. -- Alex Chekholko From andrew.robbie at gmail.com Tue Jun 16 06:04:40 2009 From: andrew.robbie at gmail.com (Andrew Robbie (Gmail)) Date: Tue, 16 Jun 2009 23:04:40 +1000 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <9f8092cc0906151258t2d992c69teef9eb2efe3a680f@mail.gmail.com> Message-ID: <6259EDD3-2E8A-45B4-AAE6-A8262A0D84BB@gmail.com> On 16/06/2009, at 5:58 AM, John Hearns wrote: > 2009/6/15 Michael Di Domenico : >> >> >> Course having said all that, if you've been watching the linux-kernel >> mailing list you've probably noticed the Xen/Kvm/Linux HV argument >> that took place last week. Makes me a little afraid to push any >> Linux >> HV solution into to production, but it's a fun experiment none the >> less... > > Could you please give us a summary? Really would be appreciated - I > don't follow those lists that closely. There is an article at LWN covering this topic: http://lwn.net/Articles/335812 From kilian.cavalotti.work at gmail.com Wed Jun 17 05:10:04 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 17 Jun 2009 14:10:04 +0200 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> Message-ID: <200906171410.04848.kilian.cavalotti.work@gmail.com> On Tuesday 16 June 2009 18:23:31 John Hearns wrote: > Highly paid, pampered systems admins who must be treated like expensive > racehorses, and not exercised too much every day. They need cool beers on > tap and luxurious offices to relax in Hey! I've been ripped off... -- Kilian From prentice at ias.edu Wed Jun 17 06:27:14 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 17 Jun 2009 09:27:14 -0400 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <200906171410.04848.kilian.cavalotti.work@gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> <200906171410.04848.kilian.cavalotti.work@gmail.com> Message-ID: <4A38EF32.8000602@ias.edu> Kilian CAVALOTTI wrote: > On Tuesday 16 June 2009 18:23:31 John Hearns wrote: >> Highly paid, pampered systems admins who must be treated like expensive >> racehorses, and not exercised too much every day. They need cool beers on >> tap and luxurious offices to relax in > > Hey! I've been ripped off... > I'm using John's e-mail as the basis for salary renegotiations right now. -- Prentice From hahn at mcmaster.ca Wed Jun 17 11:52:23 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 17 Jun 2009 14:52:23 -0400 (EDT) Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> Message-ID: > Believe it or not, but most HPC sites do not use MPI. They are all batch > systems where storage I/O is the bottleneck. However, I have tested MPI out of curiosity, why do you call this sort of thing HPC? I think "batch" or perhaps "throughput" covers it, but why _high_performance_, if in fact it composed of all routine-performance pieces? thanks, mark hahn. From lindahl at pbm.com Wed Jun 17 12:10:07 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 17 Jun 2009 12:10:07 -0700 Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> Message-ID: <20090617191007.GD29548@bx9.net> > out of curiosity, why do you call this sort of thing HPC? > I think "batch" or perhaps "throughput" covers it, but why _high_performance_, > if in fact it composed of all routine-performance pieces? It's a calling a duck a duck issue. "Warehouse computing" looks a lot like ethernet-connected HPC clusters, only larger. I can't afford to waste a couple of days running Linpack, so my clusters don't appear on the Top500. I do a mix of embarrassingly parallel jobs and tightly-coupled parallel computations. It looks a lot like how the oil & gas industry uses clusters. -- greg (wearing my Blekko hat) From jac67 at georgetown.edu Wed Jun 17 12:32:08 2009 From: jac67 at georgetown.edu (Jess Cannata) Date: Wed, 17 Jun 2009 15:32:08 -0400 Subject: [Beowulf] Off Topic: Introduction to HPC/Beowulf Cluster Class Message-ID: <4A3944B8.8080601@georgetown.edu> Georgetown University High Performance Cluster Computing Training We have two upcoming Introduction to Beowulf Clusters Administration classes (July 28-31, 2009 and October 20-23, 2009). The course covers Oscar, diskless provisioned clusters, and SGE. More information can be found at http://training.arc.georgetown.edu/ or by contacting me. Thanks. -- Jess Cannata Advanced Research Computing & High Performance Computing Training Georgetown University 202-687-3661 From diep at xs4all.nl Wed Jun 17 16:09:45 2009 From: diep at xs4all.nl (Vincent Diepeveen) Date: Thu, 18 Jun 2009 01:09:45 +0200 Subject: [Beowulf] Data Center Overload In-Reply-To: <20090615150743.GZ23524@leitl.org> References: <20090615150743.GZ23524@leitl.org> Message-ID: On Jun 15, 2009, at 5:07 PM, Eugen Leitl wrote: > > http://www.nytimes.com/2009/06/14/magazine/14search-t.html? > _r=1&ref=magazine&pagewanted=print > > Data Center Overload > > By TOM VANDERBILT This is sneaked in adware? Dropping the name 'tukwila' in a different context - a data center wintel context now, which by accident is also the already in advance failed chip from intel (an itanium quad core of 2.0 - 2.5Ghz by 2010 against some sort of real huge price is not exactly what we all wait for i assume) ? I typed in in google in different manners using english-us settings: tukwila microsoft datacenter. the ONLY hit it had that mentionned a microsoft datacenter called tukwila was from that article from Tom Vanderbilt from 14 june 2009. This link gets 2 hits. In short google never heard before from a datacenter at microsoft called tukwila. This article was the first to mention it. In total when i keep scrolling further (tukwila microsoft datacenter gets mentionned in 16200 hits here) the Vanderbilt Article gets indexed more. Out of that 16200 his article gets several hits. Vincent - search expert From prentice at ias.edu Thu Jun 18 08:20:47 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 18 Jun 2009 11:20:47 -0400 Subject: [Beowulf] IB problem/using IB diagnostics Message-ID: <4A3A5B4F.60006@ias.edu> One of my nodes has an IB problem: -------------------------------------------------------------------------- WARNING: There is at least on IB HCA found on host 'node36.aurora', but there is no active ports detected. This is most certainly not what you wanted. Check your cables and SM configuration. -------------------------------------------------------------------------- I've been trying to use some of the IB diagnostics tools to diagnose the problem, but I can't seem to figure out the correct syntax of the commands. On a node that I know is working correctly (the master node): # ibaddr GID 0xfe800000000000000005ad00001ffb29 LID start 0x1 end 0x1 # ibping 1 ibwarn: [14756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1) ibwarn: [14756] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1) # ibping 0x1 ibwarn: [14758] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1) ibwarn: [14758] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1) ibping -G 0xfe800000000000000005ad00001ffb29 ibwarn: [14763] ib_path_query_via: sa call path_query failed ibping: iberror: failed: can't resolve destination port 0xfe800000000000000005ad00001ffb29 What's am I doing wrong? How can I verify my IB configuration/SM configuration before I start pulling IB cables? -- Prentice From landman at scalableinformatics.com Thu Jun 18 09:14:30 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 18 Jun 2009 12:14:30 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <4A3A5B4F.60006@ias.edu> References: <4A3A5B4F.60006@ias.edu> Message-ID: <4A3A67E6.9080007@scalableinformatics.com> Prentice Bisbal wrote: > One of my nodes has an IB problem: > > -------------------------------------------------------------------------- > WARNING: There is at least on IB HCA found on host 'node36.aurora', but > there is > no active ports detected. This is most certainly not what you wanted. > Check your cables and SM configuration. > -------------------------------------------------------------------------- > > > I've been trying to use some of the IB diagnostics tools to diagnose the > problem, but I can't seem to figure out the correct syntax of the > commands. On a node that I know is working correctly (the master node): > Sanity check: ibnodes or ibhosts report what? This will test the connection, session manager, and a few other things. If you get nothing back then things are problematic ibdiagnet is also useful for status of the subnets. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From prentice at ias.edu Thu Jun 18 10:48:16 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 18 Jun 2009 13:48:16 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <4A3A67E6.9080007@scalableinformatics.com> References: <4A3A5B4F.60006@ias.edu> <4A3A67E6.9080007@scalableinformatics.com> Message-ID: <4A3A7DE0.8040401@ias.edu> Joe Landman wrote: > Prentice Bisbal wrote: >> One of my nodes has an IB problem: >> >> -------------------------------------------------------------------------- >> >> WARNING: There is at least on IB HCA found on host 'node36.aurora', but >> there is >> no active ports detected. This is most certainly not what you wanted. >> Check your cables and SM configuration. >> -------------------------------------------------------------------------- >> >> >> >> I've been trying to use some of the IB diagnostics tools to diagnose the >> problem, but I can't seem to figure out the correct syntax of the >> commands. On a node that I know is working correctly (the master node): >> > > Sanity check: > > ibnodes or ibhosts report what? > > This will test the connection, session manager, and a few other > things. If you get nothing back then things are problematic > > ibdiagnet is also useful for status of the subnets. > The output of these 3 commands is below. I don't see anything really unusual. Sorry for the long e-mail, and thanks for the help. [root at aurora ~]# ibnodes Ca : 0x0005ad00001ffb70 ports 2 "node64 HCA-1" Ca : 0x0005ad00001ffaec ports 2 "node63 HCA-1" Ca : 0x0005ad00001ffb1c ports 2 "node62 HCA-1" Ca : 0x0005ad00001ffab8 ports 2 "node61 HCA-1" Ca : 0x0005ad00001ffa90 ports 2 "node60 HCA-1" Ca : 0x0005ad00001ffae0 ports 2 "node59 HCA-1" Ca : 0x0005ad00001ffb30 ports 2 "node58 HCA-1" Ca : 0x0005ad00001ffba4 ports 2 "node57 HCA-1" Ca : 0x0005ad00001ffb2c ports 2 "node56 HCA-1" Ca : 0x0005ad00001ffb74 ports 2 "node55 HCA-1" Ca : 0x0005ad00001ff990 ports 2 "node54 HCA-1" Ca : 0x0005ad00001ffb58 ports 2 "node53 HCA-1" Ca : 0x0005ad00001ff964 ports 2 "node52 HCA-1" Ca : 0x0005ad00001ff99c ports 2 "node51 HCA-1" Ca : 0x0005ad00001ffb54 ports 2 "node50 HCA-1" Ca : 0x0005ad00001ffb48 ports 2 "node49 HCA-1" Ca : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1" Ca : 0x0005ad00001ffb80 ports 2 "node47 HCA-1" Ca : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1" Ca : 0x0005ad00001ffb44 ports 2 "node45 HCA-1" Ca : 0x0005ad00001ffb94 ports 2 "node44 HCA-1" Ca : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1" Ca : 0x0005ad00001ffb40 ports 2 "node42 HCA-1" Ca : 0x0005ad00001ffb24 ports 2 "node41 HCA-1" Ca : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1" Ca : 0x0005ad00001ff968 ports 2 "node39 HCA-1" Ca : 0x0005ad00001ffb6c ports 2 "node38 HCA-1" Ca : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1" Ca : 0x0005ad00001ff970 ports 2 "node36 HCA-1" Ca : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1" Ca : 0x0005ad00001ffaac ports 2 "node34 HCA-1" Ca : 0x0005ad00001ffb20 ports 2 "node33 HCA-1" Ca : 0x0005ad00001ffa34 ports 2 "node12 HCA-1" Ca : 0x0005ad00001ffa68 ports 2 "node11 HCA-1" Ca : 0x0005ad00001ff978 ports 2 "node10 HCA-1" Ca : 0x0005ad00001ff974 ports 2 "node09 HCA-1" Ca : 0x0005ad00001ffb34 ports 2 "node08 HCA-1" Ca : 0x0005ad00001ffa48 ports 2 "node07 HCA-1" Ca : 0x0005ad00001ff98c ports 2 "node06 HCA-1" Ca : 0x0005ad00001ff994 ports 2 "node05 HCA-1" Ca : 0x0005ad00001ff988 ports 2 "node04 HCA-1" Ca : 0x0005ad00001ffa3c ports 2 "node03 HCA-1" Ca : 0x0005ad00001ff9ac ports 2 "node02 HCA-1" Ca : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1" Ca : 0x0005ad00001ff96c ports 2 "node24 HCA-1" Ca : 0x0005ad00001ffb60 ports 2 "node23 HCA-1" Ca : 0x0005ad00001ffa38 ports 2 "node22 HCA-1" Ca : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1" Ca : 0x0005ad00001ffb4c ports 2 "node20 HCA-1" Ca : 0x0005ad00001ffa40 ports 2 "node19 HCA-1" Ca : 0x0005ad00001ff998 ports 2 "node18 HCA-1" Ca : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1" Ca : 0x0005ad00001ffa9c ports 2 "node16 HCA-1" Ca : 0x0005ad00001ffae8 ports 2 "node15 HCA-1" Ca : 0x0005ad00001ffb3c ports 2 "node14 HCA-1" Ca : 0x0005ad00001ff984 ports 2 "node13 HCA-1" Ca : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1" Ca : 0x0005ad00001ff97c ports 2 "node31 HCA-1" Ca : 0x0005ad00001ff980 ports 2 "node30 HCA-1" Ca : 0x0005ad00001ffab0 ports 2 "node29 HCA-1" Ca : 0x0005ad00001ffafc ports 2 "node28 HCA-1" Ca : 0x0005ad00001ffb64 ports 2 "node27 HCA-1" Ca : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1" Ca : 0x0005ad00001ffb50 ports 2 "node25 HCA-1" Ca : 0x0005ad00001ffb28 ports 2 "aurora HCA-1" Switch : 0x0005ad00070435e2 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 6, Chip A" base port 0 lid 3 lmc 0 Switch : 0x0005ad10060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 1, Chip B" base port 0 lid 9 lmc 0 Switch : 0x0005ad100604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 3, Chip B" base port 0 lid 8 lmc 0 Switch : 0x0005ad100604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 2, Chip B" base port 0 lid 7 lmc 0 Switch : 0x0005ad00060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 1, Chip A" enhanced port 0 lid 6 lmc 0 Switch : 0x0005ad000604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 3, Chip A" base port 0 lid 5 lmc 0 Switch : 0x0005ad000604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 2, Chip A" enhanced port 0 lid 4 lmc 0 Switch : 0x0005ad000704348e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 4, Chip A" base port 0 lid 14 lmc 0 Switch : 0x0005ad000704348c ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 2, Chip A" base port 0 lid 13 lmc 0 Switch : 0x0005ad000704348a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 1, Chip A" base port 0 lid 11 lmc 0 Switch : 0x0005ad0007043489 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 3, Chip A" base port 0 lid 10 lmc 0 Switch : 0x0005ad00070435d4 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 5, Chip A" base port 0 lid 12 lmc 0 [root at aurora ~]# ibhosts Ca : 0x0005ad00001ffb70 ports 2 "node64 HCA-1" Ca : 0x0005ad00001ffaec ports 2 "node63 HCA-1" Ca : 0x0005ad00001ffb1c ports 2 "node62 HCA-1" Ca : 0x0005ad00001ffab8 ports 2 "node61 HCA-1" Ca : 0x0005ad00001ffa90 ports 2 "node60 HCA-1" Ca : 0x0005ad00001ffae0 ports 2 "node59 HCA-1" Ca : 0x0005ad00001ffb30 ports 2 "node58 HCA-1" Ca : 0x0005ad00001ffba4 ports 2 "node57 HCA-1" Ca : 0x0005ad00001ffb2c ports 2 "node56 HCA-1" Ca : 0x0005ad00001ffb74 ports 2 "node55 HCA-1" Ca : 0x0005ad00001ff990 ports 2 "node54 HCA-1" Ca : 0x0005ad00001ffb58 ports 2 "node53 HCA-1" Ca : 0x0005ad00001ff964 ports 2 "node52 HCA-1" Ca : 0x0005ad00001ff99c ports 2 "node51 HCA-1" Ca : 0x0005ad00001ffb54 ports 2 "node50 HCA-1" Ca : 0x0005ad00001ffb48 ports 2 "node49 HCA-1" Ca : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1" Ca : 0x0005ad00001ffb80 ports 2 "node47 HCA-1" Ca : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1" Ca : 0x0005ad00001ffb44 ports 2 "node45 HCA-1" Ca : 0x0005ad00001ffb94 ports 2 "node44 HCA-1" Ca : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1" Ca : 0x0005ad00001ffb40 ports 2 "node42 HCA-1" Ca : 0x0005ad00001ffb24 ports 2 "node41 HCA-1" Ca : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1" Ca : 0x0005ad00001ff968 ports 2 "node39 HCA-1" Ca : 0x0005ad00001ffb6c ports 2 "node38 HCA-1" Ca : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1" Ca : 0x0005ad00001ff970 ports 2 "node36 HCA-1" Ca : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1" Ca : 0x0005ad00001ffaac ports 2 "node34 HCA-1" Ca : 0x0005ad00001ffb20 ports 2 "node33 HCA-1" Ca : 0x0005ad00001ffa34 ports 2 "node12 HCA-1" Ca : 0x0005ad00001ffa68 ports 2 "node11 HCA-1" Ca : 0x0005ad00001ff978 ports 2 "node10 HCA-1" Ca : 0x0005ad00001ff974 ports 2 "node09 HCA-1" Ca : 0x0005ad00001ffb34 ports 2 "node08 HCA-1" Ca : 0x0005ad00001ffa48 ports 2 "node07 HCA-1" Ca : 0x0005ad00001ff98c ports 2 "node06 HCA-1" Ca : 0x0005ad00001ff994 ports 2 "node05 HCA-1" Ca : 0x0005ad00001ff988 ports 2 "node04 HCA-1" Ca : 0x0005ad00001ffa3c ports 2 "node03 HCA-1" Ca : 0x0005ad00001ff9ac ports 2 "node02 HCA-1" Ca : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1" Ca : 0x0005ad00001ff96c ports 2 "node24 HCA-1" Ca : 0x0005ad00001ffb60 ports 2 "node23 HCA-1" Ca : 0x0005ad00001ffa38 ports 2 "node22 HCA-1" Ca : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1" Ca : 0x0005ad00001ffb4c ports 2 "node20 HCA-1" Ca : 0x0005ad00001ffa40 ports 2 "node19 HCA-1" Ca : 0x0005ad00001ff998 ports 2 "node18 HCA-1" Ca : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1" Ca : 0x0005ad00001ffa9c ports 2 "node16 HCA-1" Ca : 0x0005ad00001ffae8 ports 2 "node15 HCA-1" Ca : 0x0005ad00001ffb3c ports 2 "node14 HCA-1" Ca : 0x0005ad00001ff984 ports 2 "node13 HCA-1" Ca : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1" Ca : 0x0005ad00001ff97c ports 2 "node31 HCA-1" Ca : 0x0005ad00001ff980 ports 2 "node30 HCA-1" Ca : 0x0005ad00001ffab0 ports 2 "node29 HCA-1" Ca : 0x0005ad00001ffafc ports 2 "node28 HCA-1" Ca : 0x0005ad00001ffb64 ports 2 "node27 HCA-1" Ca : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1" Ca : 0x0005ad00001ffb50 ports 2 "node25 HCA-1" Ca : 0x0005ad00001ffb28 ports 2 "aurora HCA-1" [root at aurora ~]# ibdiagnet Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.2 -I- Using port 1 as the local port. -I- Discovering ... 77 nodes (12 Switches & 65 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=11 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=12 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0039 guid=0x0005ad00001ffb6d dev=25208 node38/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0015 guid=0x0005ad00001ffb3d dev=25208 node14/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x004c guid=0x0005ad00001ffab1 dev=25208 node29/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=15 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=16 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001e guid=0x0005ad00001ffb51 dev=25208 node25/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0002 guid=0x0005ad00001ffaa1 dev=25208 node01/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001b guid=0x0005ad00001ffaad dev=25208 node34/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0020 guid=0x0005ad00001ffa69 dev=25208 node11/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0019 guid=0x0005ad00001ff989 dev=25208 node04/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0016 guid=0x0005ad00001ffb35 dev=25208 node08/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x002c guid=0x0005ad00001ff995 dev=25208 node05/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0010 guid=0x0005ad00001ff98d dev=25208 node06/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x004a guid=0x0005ad00001ff9a5 dev=25208 node46/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0027 guid=0x0005ad00001ff9b9 dev=25208 node26/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=7 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=8 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=9 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001f guid=0x0005ad00001ffa35 dev=25208 node12/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0045 guid=0x0005ad00001ffaf9 dev=25208 node43/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x003d guid=0x0005ad00001ffafd dev=25208 node28/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=13 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0014 guid=0x0005ad00001ff979 dev=25208 node10/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0035 guid=0x0005ad00001ffb31 dev=25208 node58/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x002a guid=0x0005ad00001ffa9d dev=25208 node16/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0028 guid=0x0005ad00001ffa39 dev=25208 node22/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0001 guid=0x0005ad00001ffb29 dev=25208 aurora/P1 Performance Monitor counter : Value vl15_dropped : 0xffff (overflow) symbol_error_counter : 0xffff (overflow) -W- lid=0x0043 guid=0x0005ad00001ffb81 dev=25208 node47/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:65 full:65 partial:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 47 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 5 seconds. -- Prentice From prentice at ias.edu Thu Jun 18 10:49:13 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Thu, 18 Jun 2009 13:49:13 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> References: <4A3A5B4F.60006@ias.edu> <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> Message-ID: <4A3A7E19.5030408@ias.edu> John Hearns wrote: > Can you log into node36 and run ibstat or ibstatus? > Yes, I can: [root at node36 ~]# ibstat CA 'mthca0' CA type: MT25208 (MT23108 compat mode) Number of ports: 2 Firmware version: 4.8.917 Hardware version: 20 Node GUID: 0x0005ad00001ff970 System image GUID: 0x0005ad000100d050 Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 60 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x0005ad00001ff971 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x0005ad00001ff972 [root at node36 ~]# ibstatus Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:001f:f971 base lid: 0x3c sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) Infiniband device 'mthca0' port 2 status: default gid: fe80:0000:0000:0000:0005:ad00:001f:f972 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 10 Gb/sec (4X) -- Prentice From Greg at keller.net Thu Jun 18 12:29:06 2009 From: Greg at keller.net (Greg Keller) Date: Thu, 18 Jun 2009 14:29:06 -0500 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <200906181751.n5IHpT3W007304@bluewest.scyld.com> References: <200906181751.n5IHpT3W007304@bluewest.scyld.com> Message-ID: <2807D561-3DA4-47A3-B447-CA7461DD780C@Keller.net> On Jun 18, 2009, at 12:51 PM, beowulf-request at beowulf.org wrote: > Message: 4 > Date: Thu, 18 Jun 2009 11:20:47 -0400 > From: Prentice Bisbal > One of my nodes has an IB problem: > > -------------------------------------------------------------------------- > WARNING: There is at least on IB HCA found on host 'node36.aurora', > but > there is > no active ports detected. This is most certainly not what you wanted. > Check your cables and SM configuration. > -------------------------------------------------------------------------- > Does the output of vstat or ibstat show the port active and at the line rate you expect? does perfquery show packet traffic? This looks like a problem we saw when the software stack on one of our nodes was bad, any chance of "bit creep" in your environment where the installation is incomplete or are you confident it's identical to the working nodes? Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From matt at technoronin.com Thu Jun 18 20:51:49 2009 From: matt at technoronin.com (Matt Lawrence) Date: Thu, 18 Jun 2009 22:51:49 -0500 (CDT) Subject: [Beowulf] HPC fault tolerance using virtualization In-Reply-To: <4A37ED3F.1080008@vcu.edu> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <258e18ac0906160908j17bc4fd7qa6cd566acd83526c@mail.gmail.com> <9f8092cc0906160923t7471948as746bbdb4883faa5@mail.gmail.com> <4A37ED3F.1080008@vcu.edu> Message-ID: On Tue, 16 Jun 2009, Mike Davis wrote: > In my experience, Sysadmins don't want beer or luxurious offices they want > the tools that they need, proper managerial support, and respect. And I'm sure not getting any of that in my current job. And the pay scale is much lower. -- Matt It's not what I know that counts. It's what I can remember in time to use. From hearnsj at googlemail.com Fri Jun 19 01:41:39 2009 From: hearnsj at googlemail.com (John Hearns) Date: Fri, 19 Jun 2009 09:41:39 +0100 Subject: [Beowulf] Talking about virtualization Message-ID: <9f8092cc0906190141v3be50ad8m22394f7af93847a6@mail.gmail.com> Virtualbox 3 beta: http://www.theregister.co.uk/2009/06/18/sun_virtualbox_3_beta_1/ I think I've already said here that I've tested Parallels Extreme, which does the same job, and it works great. From prentice at ias.edu Fri Jun 19 05:23:00 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 19 Jun 2009 08:23:00 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <9f8092cc0906190124i6f8625ftaf14ec356db5f2af@mail.gmail.com> References: <4A3A5B4F.60006@ias.edu> <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> <4A3A7C13.2040107@ias.edu> <9f8092cc0906190124i6f8625ftaf14ec356db5f2af@mail.gmail.com> Message-ID: <4A3B8324.5000706@ias.edu> John Hearns wrote: > > > 2009/6/18 Prentice Bisbal > > > John Hearns wrote: > > Can you log into node36 and run ibstat or ibstatus? > > > > Looks good to me! > Links are up and it sees a subnet manager. As Greg says, looks like > something wonky in the script which is reporting > the node status?? It's actually an MPI job (HPL using OpenMPI) which is reporting the problem. The head scratching continues... -- Prentice From buccaneer at rocketmail.com Fri Jun 19 06:00:16 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Fri, 19 Jun 2009 06:00:16 -0700 (PDT) Subject: [Beowulf] IB problem/using IB diagnostics Message-ID: <148299.47553.qm@web30602.mail.mud.yahoo.com> > It's actually an MPI job (HPL using OpenMPI) which is > reporting the > problem. > > The head scratching continues... > I had a similar problem earlier in year with some blades. It was pretty ugly for a while. Most of it was related to firmware on the blades and IB. From gus at ldeo.columbia.edu Fri Jun 19 09:14:52 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 19 Jun 2009 12:14:52 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <4A3B8324.5000706@ias.edu> References: <4A3A5B4F.60006@ias.edu> <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> <4A3A7C13.2040107@ias.edu> <9f8092cc0906190124i6f8625ftaf14ec356db5f2af@mail.gmail.com> <4A3B8324.5000706@ias.edu> Message-ID: <4A3BB97C.7000207@ldeo.columbia.edu> Prentice Bisbal wrote: > John Hearns wrote: >> >> 2009/6/18 Prentice Bisbal > >> >> John Hearns wrote: >> > Can you log into node36 and run ibstat or ibstatus? >> > >> >> Looks good to me! >> Links are up and it sees a subnet manager. As Greg says, looks like >> something wonky in the script which is reporting >> the node status?? > > It's actually an MPI job (HPL using OpenMPI) which is reporting the > problem. > > The head scratching continues... > Hi Prentice, list Just in case you haven't seen this ... Are you using OpenMPI 1.3.0 or 1.3.1? Those versions have a memory leak bug when using IB. The solution for the memory leak is to upgrade to 1.3.2. A workaround is to use -mca mpi_leave_pinned=0. See: http://www.open-mpi.org/community/lists/announce/2009/04/0030.php https://svn.open-mpi.org/trac/ompi/ticket/1853 My HPL with OpenMPI 1.3.1 crashed when using lots of memory. I upgraded to 1.3.2, which fixed the problem, and I haven't looked at the error messages, so your problem may be different. However, memory leaks can produce weird errors, hard to diagnose. My $0.02. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From prentice at ias.edu Fri Jun 19 09:48:37 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 19 Jun 2009 12:48:37 -0400 Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <4A3BB97C.7000207@ldeo.columbia.edu> References: <4A3A5B4F.60006@ias.edu> <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> <4A3A7C13.2040107@ias.edu> <9f8092cc0906190124i6f8625ftaf14ec356db5f2af@mail.gmail.com> <4A3B8324.5000706@ias.edu> <4A3BB97C.7000207@ldeo.columbia.edu> Message-ID: <4A3BC165.5030805@ias.edu> Gus Correa wrote: > Prentice Bisbal wrote: >> John Hearns wrote: >>> >>> 2009/6/18 Prentice Bisbal > >>> >>> John Hearns wrote: >>> > Can you log into node36 and run ibstat or ibstatus? >>> > >>> >>> Looks good to me! >>> Links are up and it sees a subnet manager. As Greg says, looks like >>> something wonky in the script which is reporting >>> the node status?? >> >> It's actually an MPI job (HPL using OpenMPI) which is reporting the >> problem. >> >> The head scratching continues... >> > > Hi Prentice, list > > Just in case you haven't seen this ... > Are you using OpenMPI 1.3.0 or 1.3.1? > Those versions have a memory leak bug when using IB. > The solution for the memory leak is to upgrade to 1.3.2. > A workaround is to use -mca mpi_leave_pinned=0. > See: > > http://www.open-mpi.org/community/lists/announce/2009/04/0030.php > https://svn.open-mpi.org/trac/ompi/ticket/1853 > > My HPL with OpenMPI 1.3.1 crashed when using lots of memory. > I upgraded to 1.3.2, which fixed the problem, > and I haven't looked at the error messages, > so your problem may be different. > However, memory leaks can produce weird errors, hard to diagnose. > I'm using OpenMPI 1.2.8 -- Prentice From pal at di.fct.unl.pt Fri Jun 19 10:28:20 2009 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri, 19 Jun 2009 18:28:20 +0100 (WEST) Subject: [Beowulf] IB problem/using IB diagnostics In-Reply-To: <4A3B8324.5000706@ias.edu> References: <4A3A5B4F.60006@ias.edu> <9f8092cc0906180850t10bf14dfob53f7fd9481a7bbc@mail.gmail.com> <4A3A7C13.2040107@ias.edu> <9f8092cc0906190124i6f8625ftaf14ec356db5f2af@mail.gmail.com> <4A3B8324.5000706@ias.edu> Message-ID: <3007.10.170.133.93.1245432500.squirrel@www.di.fct.unl.pt> > > It's actually an MPI job (HPL using OpenMPI) which is reporting the > problem. > > The head scratching continues... > It seems, from the ongoing discussion , that you do not have a hw problem, but an (open)MPI one; I have seen openMPI failing because some user-level (or kernel; in my case it was user) verbs/etc. library missing. Sugestions: 1) check the job runs, with say, -mca btl ^udapl (exclude UDAPL and see if it runs) or e.g., -mca btl openib,tcp,sm,self or 2) more tediously, check that all libraries present in a non-failing node are available in the failing one... There is a "Getting Started with InfiniBand" page which has the names of the libraries/products that you should have loaded to have a fully functioning IB stack - it solved my problem :-) HTH paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal at di.fct.unl.pt 2829-516 Caparica, PORTUGAL From tomislav.maric at gmx.com Thu Jun 18 09:59:35 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Thu, 18 Jun 2009 18:59:35 +0200 Subject: [Beowulf] noobs: what comes next? Message-ID: <4A3A7277.6010207@gmx.com> Hello everyone, it's the noob again. Actually, since my brother Mario joined my quest, we're in plural now. After reading loads of info from the net, we kind of figure that we could read on for the rest of our lives, and still won't learn "enough". Time to do some work. Question: the cheapest think we can afford (with me being a mech. eng. graduate student and my brother in Highschool) would be around 6ish Intel Mini ITX motherboards like this one: Intel D945GCLF2D mini-ITX Atom 2x1.6Ghz a gigabit eth. switch, some cables, and a large box. :) As the master node we would use our desktop computer. We're trying to assemble something that's really cheap (both on hardware as well as power costs) and that still can do some work. I would run some coarse grid CFD cases for my MSc on the little guy. Does this make any sense at all? We know that the proper thing would be to benchmark, and then scale, but in this noobish cheap baby case, there's nothing really to benchmark for. If we manage to get it working and it proves to be worth doing at least some coarse grid CFD runs, it would be great for a start. Best regards, Tomislav & Mario From rockwell at pa.msu.edu Thu Jun 18 11:03:15 2009 From: rockwell at pa.msu.edu (Tom Rockwell) Date: Thu, 18 Jun 2009 14:03:15 -0400 Subject: [Beowulf] Blade Networks Switches? Message-ID: <4A3A8163.30806@pa.msu.edu> Hi, Anybody using Blade Networks switches? Any comments on them? Looking at their 24 port 10GE switches. Pricing is attractive. http://www.bladenetwork.net/ Thanks, Tom Rockwell Michigan State U. From brockp at umich.edu Thu Jun 18 14:41:36 2009 From: brockp at umich.edu (Brock Palen) Date: Thu, 18 Jun 2009 17:41:36 -0400 Subject: [Beowulf] PGI_POST_COMPILE_OPTS Message-ID: I noticed that the cray XT machines are setting an environment variable 'PGI_POST_COMPILE_OPTS' to -tp barcelona-64 Is this a cray specific thing? the -tp flag is a normal pgi compiler option, I want to know if I can use this variable to set options for specific people (module files), and not make a change global in localrc. I can't find it documented any place, other than the cray module files, and I can't find its value any place looking on NICS Kraken, any input would be great to know if I can use this option or not. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 From drcoolsanta at gmail.com Sat Jun 20 07:33:17 2009 From: drcoolsanta at gmail.com (Dr Cool Santa) Date: Sat, 20 Jun 2009 20:03:17 +0530 Subject: [Beowulf] Possible to perform different jobs on different nodes? Message-ID: <86b56470906200733k4601152ei9ba9da002269a22e@mail.gmail.com> Because of the software that I am using, I am forced to use MPICH1 implementation. And because the softeware is programmed in such a way, most of the time, I don't need more than 3-5 nodes while I have 8. I just want to know whether it is possible for it to use the first 3 nodes for the first job and the next 3 for the next and so on. I think it is not doing so right now and whether it would be possible with this specific implementation is unsure. If I must migrate to something else I am ready, I just need to know what I should know from you people. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzaletnev at yandex.ru Sat Jun 20 10:23:41 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Sat, 20 Jun 2009 21:23:41 +0400 Subject: [Beowulf] OpenFOAM, Linux & Compilers Message-ID: <218571245518621@webmail114.yandex.ru> Dear Sir/Madam! A few days ago I tried to compile OpenFOAM, I didn't succeed in compiling paraFoam, neiteher I didn't succeed in using ParaView 3.4, downloaded from official site, in viewing VTK-format data from Open FOAM. Compiler refused finding cmake. Earlier I worked as software tester than a programmer in NT projects. But know I hardly can't understand how compilation in *nix takes place, what is make, cmake, profile, etc. What is *nix executable? I've heard that Gentoo compiles executable each time it is to be run, and what other Linuces? Is here somebody who can briefly tell me about software development under *nix? And about what benefits gives using Sun Studio or Portland Group PGI instead of gcc & gdb, by the way what else do I need to compile and run my first project under *nix? I use Ubuntu 9.04 x64, openSuSE 11.1 x64, Fedora 9 x64, Fedora 11 ppc64 & Solaris 10. Sincerely, Dmitry Zaletnev From jlforrest at berkeley.edu Mon Jun 22 09:24:19 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 22 Jun 2009 09:24:19 -0700 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A3A7277.6010207@gmx.com> References: <4A3A7277.6010207@gmx.com> Message-ID: <4A3FB033.8060301@berkeley.edu> Tomislav Maric wrote: > After reading loads of info from the net, we kind of figure that we > could read on for the rest of our lives, and still won't learn "enough". > Time to do some work. My suggestion to you is to forget about buying anything new right now. Instead, find some cheap used P4 PCs with at least 1GB of RAM. In the US such things can easily be found for ~$150. Then, try loading Rocks (www.rocksclusters.org) and/or Perceus (www.perceus.org/portal/). This will keep you busy for a while and will help you figure out what you're doing. Expect to fail several times while you're learning. This is normal. Once you have more knowledge, and you've tried running the actual program you want to run in the future, then start thinking about the best hardware to get. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From tomislav.maric at gmx.com Mon Jun 22 09:19:16 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Mon, 22 Jun 2009 18:19:16 +0200 Subject: [Beowulf] OpenFOAM, Linux & Compilers In-Reply-To: <218571245518621@webmail114.yandex.ru> References: <218571245518621@webmail114.yandex.ru> Message-ID: <4A3FAF04.1010904@gmx.com> Dmitry Zaletnev wrote: > Dear Sir/Madam! > A few days ago I tried to compile OpenFOAM, I didn't succeed in compiling paraFoam, neiteher I didn't succeed in using ParaView 3.4, > downloaded from official site, in viewing VTK-format data from Open FOAM. Compiler refused finding cmake. Earlier I worked as software tester than a programmer in NT projects. But know I hardly can't understand how compilation in *nix takes place, what is make, cmake, profile, etc. What is *nix executable? I've heard that Gentoo compiles executable each time it is to be run, and what other Linuces? Is here somebody who can briefly tell me about software development under *nix? And about what benefits gives using Sun Studio or Portland Group PGI instead of gcc & gdb, by the way what else do I need to compile and run my first project under *nix? I use Ubuntu 9.04 x64, openSuSE 11.1 x64, Fedora 9 x64, Fedora 11 ppc64 & Solaris 10. > > Sincerely, > Dmitry Zaletnev > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Hello Dmitry, I'm using OpenFOAM for a while now. The question is a bit misdirected, if I may say - the OpenFOAM forum is the right place for these kind of questions, namely: http://www.cfd-online.com/Forums/openfoam-installation/ OpenFOAM has wmake bash scripting utility (and others like foamNew) that enables You to build it (and your applications, template libs, etc.) without going into the details of GNU make. Here is the link for building OF the easy way: http://openfoamwiki.net/index.php/Howto_compile_OpenFOAM_the_easy_way follow these and You should be ok. The -dev version is not "official", but it is better and newer and opener :))) . I'll be brave here as a noob in almost everything, and write a little side note: Linux/Unix os-es give you an option to install(copy) your binaries (machine language code) and/or headers in system directories designated for that use (google Linux FHS), but You can compile the sources and copy the binaries wherever You wish. GNU Make is a utility that enables configuration of this process for large apps and checks what needs to be compiled (that's really cool when You have 150 MB of C++ code like in OF) and linked and what is to be left as it is. Anyway, my advice: 1) follow the link above 2) if the problems arise, search the OF forum http://www.cfd-online.com/Forums/openfoam-installation/ 3) contact me on my email if the problems survive the fight Best regards, Tomislav From john.hearns at mclaren.com Mon Jun 22 09:21:06 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 22 Jun 2009 17:21:06 +0100 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A3A7277.6010207@gmx.com> References: <4A3A7277.6010207@gmx.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0C463A2C@milexchmb1.mil.tagmclarengroup.com> > We're trying to assemble something that's really cheap (both on > hardware as well as power costs) and that still can do some work. I > would run some coarse grid CFD cases for my MSc on the little guy. Does > this make any sense at all? > > We know that the proper thing would be to benchmark, and then scale, > but in this noobish cheap baby case, there's nothing really to > benchmark > for. If we manage to get it working and it proves to be worth doing at > least some coarse grid CFD runs, it would be great for a start. > > It does sound good. I would guess you are looking are looking at using OpenFOAM for the CFD solver? One thing to look at though - how much RAM does each of these board have? You might be better using the recipe for the cheap cluster on Doug Eadlines Clustermonkey site. http://www.clustermonkey.net//content/view/211/33/ By doing away with the cases they made a much cheaper cluster. If you use standard-sized motherboards you might get cheaper parts to suit them, And also have spare PCI clots for (say) a second gigabit network. Pps. I hate to say this, but pragmatically with 6 of these boards you get 12 times 1.6Ghz cores. You can buy single workstations with eight cores of Nehalem at 2.6 Ghz for not a lot of money. You can still program these using MPI, and run Openfoam on them if your aim really is to get runs done For your MSc. If on the other hand, you want to learn about clusters buy the boards and let us know how it works! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Mon Jun 22 09:24:16 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 22 Jun 2009 17:24:16 +0100 Subject: [Beowulf] OpenFOAM, Linux & Compilers In-Reply-To: <218571245518621@webmail114.yandex.ru> References: <218571245518621@webmail114.yandex.ru> Message-ID: <68A57CCFD4005646957BD2D18E60667B0C463A36@milexchmb1.mil.tagmclarengroup.com> > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Dmitry Zaletnev > Sent: 20 June 2009 18:24 > To: beowulf at beowulf.org > Subject: [Beowulf] OpenFOAM, Linux & Compilers > > Dear Sir/Madam! > A few days ago I tried to compile OpenFOAM, I didn't succeed in > compiling paraFoam, neiteher I didn't succeed in using ParaView 3.4, That's a boat load of questions! An executable is the actual program file of machine instructions which you run. I'd guess Windows has the same - think of .exe files! Regarding Gentoo, it does indeed recompile system executables every time they are installed, The thinking being you get best performance when they are optimised for your exact architectire. Me, I say if you want to run a code like Openfoam or other scientific application stick to a mainstream distribution. Like Redhat Enterprise, Fedora, SuSE (Enterprise or OpenSUSE). I may be wrong, as I am not a Gentoo weenie. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From tomislav.maric at gmx.com Mon Jun 22 13:24:27 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Mon, 22 Jun 2009 22:24:27 +0200 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A3FB033.8060301@berkeley.edu> References: <4A3A7277.6010207@gmx.com> <4A3FB033.8060301@berkeley.edu> Message-ID: <4A3FE87B.3090703@gmx.com> Jon Forrest wrote: > My suggestion to you is to forget about buying anything new > right now. Instead, find some cheap used P4 PCs with at least > 1GB of RAM. In the US such things can easily be found for ~$150. I've thought about collecting old PCs from my friends, but they are all different, and I would like to follow the general rule of newbish clustering: use homogeneous nodes. Second problem is that I can get miniITX motherboard for 89$. That's the optimal price. I'm worried about the power consumption of regular pc's, and the power needed for miniITX is really low (90 wats for the motherboard and CPU included). The third issue is that I'm not just using this cluster for learning. At the end, it would be amazing to be able to run about 500 000 control volume CFD cases used so far (ship hydrodynamics simulations - movement of the ship excluded) for about 6 days worth of 20 seconds physical time on a cluster that's easily moved and uses regular PC's PSU (cheap on electricity). For scaling info: I've ran 175 000 FV cells on my dual core laptop (HP compaq dual core 1.73 Ghz) for 6 days to get 4 seconds physical time. Does this sum up to a cluster being about 14 times faster machine than my laptop? I would be satisfied with much less. :) Thank You very much for Your advice, best regards, Tomislav From deadline at eadline.org Mon Jun 22 14:39:46 2009 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 22 Jun 2009 17:39:46 -0400 (EDT) Subject: [Beowulf] noobs: what comes next? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0C463A2C@milexchmb1.mil.tagmclarengro up.com> References: <4A3A7277.6010207@gmx.com> <68A57CCFD4005646957BD2D18E60667B0C463A2C@milexchmb1.mil.tagmclarengroup.com> Message-ID: <59167.192.168.1.213.1245706786.squirrel@mail.eadline.org> If you want to build a small cheap usable cluster check this out. http://limulus.basement-supercomputing.com/ The project is moving a little slow, but that is because I have been working on packaging everything (16 cores using quads) in one case with one power supply. (more on the project page) Some software announcements will be forthcoming as well. Actually, I think of it as more of a parallel workstation than a production cluster. -- Doug > > >> We're trying to assemble something that's really cheap (both on >> hardware as well as power costs) and that still can do some work. I >> would run some coarse grid CFD cases for my MSc on the little guy. > Does >> this make any sense at all? >> >> We know that the proper thing would be to benchmark, and then > scale, >> but in this noobish cheap baby case, there's nothing really to >> benchmark >> for. If we manage to get it working and it proves to be worth doing at >> least some coarse grid CFD runs, it would be great for a start. >> >> > It does sound good. > > I would guess you are looking are looking at using OpenFOAM for the CFD > solver? > One thing to look at though - how much RAM does each of these board > have? > > > You might be better using the recipe for the cheap cluster on Doug > Eadlines Clustermonkey site. > http://www.clustermonkey.net//content/view/211/33/ > By doing away with the cases they made a much cheaper cluster. > If you use standard-sized motherboards you might get cheaper parts to > suit them, > And also have spare PCI clots for (say) a second gigabit network. > > > > Pps. I hate to say this, but pragmatically with 6 of these boards you > get 12 times 1.6Ghz cores. > You can buy single workstations with eight cores of Nehalem at 2.6 Ghz > for not a lot of money. > You can still program these using MPI, and run Openfoam on them if your > aim really is to get runs done > For your MSc. > If on the other hand, you want to learn about clusters buy the boards > and let us know how it works! > > > > The contents of this email are confidential and for the exclusive use of > the intended recipient. If you receive this email in error you should not > copy it, retransmit it, use it or disclose its contents but should return > it to the sender immediately and delete your copy. > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From james.p.lux at jpl.nasa.gov Mon Jun 22 15:39:19 2009 From: james.p.lux at jpl.nasa.gov (Lux, James P) Date: Mon, 22 Jun 2009 15:39:19 -0700 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <59167.192.168.1.213.1245706786.squirrel@mail.eadline.org> References: <4A3A7277.6010207@gmx.com> <68A57CCFD4005646957BD2D18E60667B0C463A2C@milexchmb1.mil.tagmclarengroup.com> <59167.192.168.1.213.1245706786.squirrel@mail.eadline.org> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Douglas Eadline > Sent: Monday, June 22, 2009 2:40 PM > To: Hearns, John > Cc: beowulf at beowulf.org > Subject: RE: [Beowulf] noobs: what comes next? > > > If you want to build a small cheap usable cluster check this out. > > http://limulus.basement-supercomputing.com/ > > The project is moving a little slow, but that is because I > have been working on packaging everything (16 cores using > quads) in one case with one power supply. > (more on the project page) > > Some software announcements will be forthcoming as well. > > Actually, I think of it as more of a parallel workstation > than a production cluster. > > -- > Doug > > Very interesting. Similar to the speculative(never implemented) "can you build a cluster from parts bought only from Wal-Mart?" http://home.earthlink.net/~jimlux/beowulf/walmart.htm As far as Jeff wanting it to fit under an airplane seat. Over the years, I've given some thought to that, and of course, some folks have done it (clusters in a toolbox/lunchbox etc.. Usually using Via MiniITX mobos of one sort or another.. They're out there on the web). Clusters, it's true, but realistically, peformance that is not as good as a single laptop with a much faster processor. I think, though, that what Jeff wants is some significant improvement in computational horsepower over, say, a laptop. I'd propose that you want, say, 10 times the crunch of a laptop of comparable generation(otherwise, it's not worth doing.. If you picked a factor of 2-4, you could also just wait a year, and the laptop would have the power, at much less complexity). The 100W idle, 1kW full bore is a good place to be power wise. Heck, I just got one of those instant espresso making machines for Fathers Day, and it draws 1200W, so clearly that's a reasonable short term power draw for a "home appliance". And, as you've outlined in Limulus, an architectural approach that lends itself to scaling or rebuilding with new hardware as it becomes available would be nice. Personally, I'd love it if it ran basically headless, and I used a regular small notebook/laptop computer as the "control/user interface" I'm going to be carrying the notebook anyway, so might as well use it. Wireless networking over short distances would work fine. Hmm, can my Macbook Air be a DHCP server so it can boot with PXE? I suppose so. Sheetmetal work is going to be the key. The networking, hardware, and software is pretty rack and stack of commodity stuff, for the most part (granted, it's non trivial to make a turnkey system..) James Lux, P.E. Task Manager, SOMD Software Defined Radios Flight Communications Systems Section Jet Propulsion Laboratory 4800 Oak Grove Drive, Mail Stop 161-213 Pasadena, CA, 91109 +1(818)354-2075 phone +1(818)393-6875 fax > From tomislav.maric at gmx.com Mon Jun 22 15:51:42 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Tue, 23 Jun 2009 00:51:42 +0200 Subject: [Beowulf] noobs: what comes next? Message-ID: <4A400AFE.6030900@gmx.com> I've sent this mail as a direct reply to Mr. John Hearns, so I'm sending it again to the list - my apologies to Mr. Hearns. Hearns, John wrote: > > > > I would guess you are looking are looking at using OpenFOAM for the CFD > > solver? Of course, what else is there? :)) > > One thing to look at though - how much RAM does each of these board > > have? > > 2GB 240-pin DIMM, unbuffered Non-ECC memory > > > > You might be better using the recipe for the cheap cluster on Doug > > Eadlines Clustermonkey site. > > http://www.clustermonkey.net//content/view/211/33/ > > By doing away with the cases they made a much cheaper cluster. I've read both articles, for Microwulf and LunchBox. This is really a great way to go, but I've done some numbers: AMD Athlon 64 X2 2,66 GHz and ASUS mATX motherboard with GLAN and DDR2 RAM socket cost together about 170 USD. and my original mboard with Atom CPU and 2GB DDR2 RAM costs about 120 USD. They are slower (1.66 GHz), but I'm worried about the power consumption. The prices for miniITX are from here: http://www.mini-box.com/Intel-D945GCLF2D-Mini-ITX-Motherboard If I buy a 100 USD Gigabit switch, 3 HDDs (10000 RPM and about 160 GB) and 10 of these motherboards, it would sum up to 1600 USD. I'm just a huble noob. Please note that I'm just trying to find out what's best for me, with no experience in price/performance ratios of current COTS hardware or detailed knowledge of benchmarking. > > If you use standard-sized motherboards you might get cheaper parts to > > suit them, > > And also have spare PCI clots for (say) a second gigabit network. > > > > This might be a problem, but I can use my desktop comp for a master node. I only have to buy additional Gig. Eth. NIC. That was our original idea. > > > > Pps. I hate to say this, but pragmatically with 6 of these boards you > > get 12 times 1.6Ghz cores. > > You can buy single workstations with eight cores of Nehalem at 2.6 Ghz > > for not a lot of money. The cheapest i7 i could find in Croatia is Intel Core i7, 920, Socket 1366, 2.66GHz QuadCore and costs alone 400$. To get the approx. configuration to the one above with miniITX, I would have to by 2 of them on 2 motherboards costing about 300 USD each and forcing me to pay additional 70 USD for DDR3 RAM. This comes out about 1540 USD for this config with 4 GB RAM, and miniITX combo has 10 GB. This is without the disks, the switch, two power supplies and kablovinje (that's Croatian slang for cables :) . I'm also worried about the coarse grained nature of CFD simulations if I go down this road - the IPC is really low with respect to RAM access and data manipulation (again, I might be wrong, please correct me). Having 4 cores accessing 2GB of RAM all the time... can this be a problem? > > You can still program these using MPI, and run Openfoam on them if your > > aim really is to get runs done > > For your MSc. > > If on the other hand, you want to learn about clusters buy the boards > > and let us know how it works! > > > > I think I'll try and go with miniITX, this choice seems to be a blend of coolness (I get to learn clustering) and it's cheap enough for a graduate student from Croatia and his Highschool brother. :)) Thank You very much for the advice, best regards, Tomislav From hahn at mcmaster.ca Mon Jun 22 17:16:06 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 22 Jun 2009 20:16:06 -0400 (EDT) Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A3FE87B.3090703@gmx.com> References: <4A3A7277.6010207@gmx.com> <4A3FB033.8060301@berkeley.edu> <4A3FE87B.3090703@gmx.com> Message-ID: > different, and I would like to follow the general rule of newbish > clustering: use homogeneous nodes. I wonder where that rule came from. homogeneity is certainly more convenient, and might actually make sense for certain domains (where some amount of global synchrony is required.) but for a starter cluster, heck no! for CFD, you simply decompose your work to match the speed of your nodes - you don't need to assume each node is equally fast, or that each piece of work takes the same effort... > Second problem is that I can get miniITX motherboard for 89$. That's the there's nothing wrong with the form-factor - the problem is that atom processors are not built for speed. > optimal price. I'm worried about the power consumption of regular pc's, > and the power needed for miniITX is really low (90 wats for the > motherboard and CPU included). that would be a strange miniitx/atom system if it dissipated that much - I think my media server is more like 30W full-on. the atom approach, while amusing, starts out with so many disadvantages. it's slow, with a small cache and slow memory system. OK, you could still do OK if you use enough of them. but "enough" will mean extra infrastructure per node (gpu/northbridge, random MB devices, all of which dissipate more than the CPU). not to mention the fact that no task that is meaningfully "parallel" scales perfectly: with lots of slow processors, you really want to take an approach like sicortex (rip?) who provided a very nice network as well as great packaging. > The third issue is that I'm not just using this cluster for learning. At > the end, it would be amazing to be able to run about 500 000 control > volume CFD cases used so far (ship hydrodynamics simulations - movement > of the ship excluded) for about 6 days worth of 20 seconds physical time > on a cluster that's easily moved and uses regular PC's PSU (cheap on > electricity). I think you're making some strange assumptions about speed:power ratios. a 65W quad-core mainstream processor will be a joy to use compared to a heap of atoms, even if you completely ignore the overhead of the support electronics. for PSUs, you simply want to do two things: size it so you're running at ~85% capacity, and get a plus80 model (gold, etc). > For scaling info: I've ran 175 000 FV cells on my dual core laptop > (HP compaq dual core 1.73 Ghz) for 6 days to get 4 seconds physical > time. Does this sum up to a cluster being about 14 times faster machine > than my laptop? I would be satisfied with much less. :) scaling is always a struggle: you want fairly significant cpus, unless you can specifically provide low-overhead infrastructure (support chips, networking - basically "custom"). From hahn at mcmaster.ca Mon Jun 22 17:26:38 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 22 Jun 2009 20:26:38 -0400 (EDT) Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A400AFE.6030900@gmx.com> References: <4A400AFE.6030900@gmx.com> Message-ID: > If I buy a 100 USD Gigabit switch, 3 HDDs (10000 RPM and about 160 GB) so a $100 Gb switch should get you ~12 ports. why in Amdahl's name would you waste your money on 10k rpm disks, especially when in your previous breath you were talking about power? maybe I should have said Moore's name, since disk technology, being an areal process like IC density, is scaling exponentially. that means that modern disks already provide more bandwidth than you can use. there's a good argument for high-rpm disks if you have an inherently seeky workload, but I really don't think CFD is. > what's best for me, with no experience in price/performance ratios of > current COTS hardware or detailed knowledge of benchmarking. considered as bare chips, atom's flops/power is probably not bad. once you factor in the support chips... core2 or k10-generation systems will be fast, cool and cheap. From alscheinine at tuffmail.us Mon Jun 22 20:42:59 2009 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Mon, 22 Jun 2009 22:42:59 -0500 Subject: [Beowulf] PGI_POST_COMPILE_OPTS In-Reply-To: References: <4A402079.6060107@tuffmail.us> Message-ID: <4A404F43.8030507@tuffmail.us> Brock Palen, In your original posting you spoke of setting options using PGI_POST_COMPILE_OPTS without specifying the type of option. In your follow-up post you wrote > using PGI's unified binary, to support all of these > -tp x64,amd64e,barcelona-64 Interesting. A PGI WWW page says: > GI compilers can generate a single PGI Unified Binary? executable > fully optimized for both Intel EM64T and AMD64 processors, delivering > all the benefits of a single x64 platform while enabling you to leverage > the latest innovations from both Intel and AMD. For different AMD revisions also in the same binary? Reading the manual, it seems that there are no restrictions on what can be in the "-tp" list. The environment variable PGI_POST_COMPILE_OPTS seems like the appropriate place to set the "-tp" options. Tell use whether the Cray digests this innovative use. Alan -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 From tomislav.maric at gmx.com Tue Jun 23 09:23:31 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Tue, 23 Jun 2009 18:23:31 +0200 Subject: [Beowulf] noobs: what comes next? In-Reply-To: References: <4A400AFE.6030900@gmx.com> Message-ID: <4A410183.9000405@gmx.com> Thank You for the detailed answer, it's really educational. I've been reading Atom's description today on tom's hardware, and there it's also described in the way You have described it. I was thinking about using RAID because I was worried about backup space and speed for data transfer on the hard disk since there's a lot of data to be stored - about 1GB for a case with 175 000 finite volumes case. Now that I'm more informed I think I'll listen to You and Mr. John Hearns and go with standard issue CPUs and mboards and, besides that, whatever forgotten attic/basement hardware I can get my hands on from my friends. Well, as I've said - I don't really know much, yet. I can only hope for patience with my newbish questions/statements, like the one regarding the power consumption for the miniITX - if it's PSU can give out 90 Wats, that doesn't mean it will be spent. :) Thanks again, Tomislav From gus at ldeo.columbia.edu Tue Jun 23 12:03:05 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 23 Jun 2009 15:03:05 -0400 Subject: [Beowulf] PGI_POST_COMPILE_OPTS In-Reply-To: <4A404F43.8030507@tuffmail.us> References: <4A402079.6060107@tuffmail.us> <4A404F43.8030507@tuffmail.us> Message-ID: <4A4126E9.1060904@ldeo.columbia.edu> Hi Brock, Alan, list First off ... Brock: thanks for the RCE Podcast. It is great! For those who don't know it, and are interested in all aspects of HPC, here is the link: http://www.rce-cast.com/ Brock Palen wrote: > I noticed that the cray XT machines are setting an environment variable > 'PGI_POST_COMPILE_OPTS' to -tp barcelona-64 > > Is this a cray specific thing? It seems to be a Cray thing indeed. There are equivalent flags for GNU and PATHSCALE on Cray also. See these links: http://www.nccs.gov/computing-resources/jaguar/software/?&software=libsci http://www.nics.tennessee.edu/user-support/software?&software=libsci See also pages 65-67 of this Cray document: http://docs.cray.com/books/S-2393-13/S-2393-13.pdf I would guess the Cray "libsci" somehow adds those flags on the fly when you launch the PGI compilers (or alias the pgi compiler commands). On our (non-Cray) Linux cluster with AMD quad-core I have to include those optimization flags by hand. In our case, for AMD Shanghai it is -tp shanghai-64. Likewise for Intel machines (with different -tp flags, of course). I have checked a number of PGI files and there is no such a thing as a PGI_POST_COMPILE_OPTS environment variable. For instance, their suggestions for "module" files set these environment variables: PGI, CC, CPP, CXX, FC, F77, F90, PATH, MANPATH, and LD_LIBRARY_PATH. However, no machine-dependent optimization flag is set, no PGI_POST_COMPILE_OPTS either. (The default machine-independent optimization level, according to man pgf90, is -01.) You can check these module.skel files in: pgi/8.0-4/linux86-64/8.0-4/etc/modulefiles/ (whatever version you have mine is 8.0-4). Note also that the PGI directory tree has a specific "cray" directory, with several libraries, which suggests that PGI does some special voodoo on Cray machines. Intel has the ifort.cfg, icc.cfg, icpc.cfg files that allow the compilers to be configured/customized with user/system options. I couldn't find an equivalent scheme on PGI. My $0.02. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > the -tp flag is a normal pgi compiler > option, I want to know if I can use this variable to set options for > specific people (module files), and not make a change global in localrc. > > I can't find it documented any place, other than the cray module files, > and I can't find its value any place looking on NICS Kraken, any input > would be great to know if I can use this option or not. > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > Alan Louis Scheinine wrote: > Brock Palen, > > In your original posting you spoke of setting options using > PGI_POST_COMPILE_OPTS > without specifying the type of option. In your follow-up post you wrote > >> using PGI's unified binary, to support all of these >> -tp x64,amd64e,barcelona-64 > > Interesting. A PGI WWW page says: > PGI compilers can generate a single PGI Unified Binary? executable > > fully optimized for both Intel EM64T and AMD64 processors, delivering > > all the benefits of a single x64 platform while enabling you to leverage > > the latest innovations from both Intel and AMD. > > For different AMD revisions also in the same binary? Reading the > manual, it seems > that there are no restrictions on what can be in the "-tp" list. > > The environment variable PGI_POST_COMPILE_OPTS seems like the appropriate > place to set the "-tp" options. Tell use whether the Cray digests this > innovative use. > > Alan > From gdjacobs at gmail.com Tue Jun 23 16:24:33 2009 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue, 23 Jun 2009 18:24:33 -0500 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A410183.9000405@gmx.com> References: <4A400AFE.6030900@gmx.com> <4A410183.9000405@gmx.com> Message-ID: <4A416431.80703@gmail.com> Tomislav Maric wrote: > Thank You for the detailed answer, it's really educational. I've been > reading Atom's description today on tom's hardware, and there it's also > described in the way You have described it. > > I was thinking about using RAID because I was worried about backup space > and speed for data transfer on the hard disk since there's a lot of data > to be stored - about 1GB for a case with 175 000 finite volumes case. > Now that I'm more informed I think I'll listen to You and Mr. John > Hearns and go with standard issue CPUs and mboards and, besides that, > whatever forgotten attic/basement hardware I can get my hands on from my > friends. Transfer speed gets a speed bump with 10k Velociraptors, but it's not that huge a bump. You're better off buying cheap WD Caviar Greens and spending the difference on RAM. Please note, RAID is not a backup method, it's to increase the resiliency of the filesystem. -- Geoffrey D. Jacobs From gdjacobs at gmail.com Tue Jun 23 16:28:53 2009 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue, 23 Jun 2009 18:28:53 -0500 Subject: [Beowulf] Possible to perform different jobs on different nodes? In-Reply-To: <86b56470906200733k4601152ei9ba9da002269a22e@mail.gmail.com> References: <86b56470906200733k4601152ei9ba9da002269a22e@mail.gmail.com> Message-ID: <4A416535.3060300@gmail.com> Dr Cool Santa wrote: > Because of the software that I am using, I am forced to use MPICH1 > implementation. > And because the softeware is programmed in such a way, most of the time, > I don't need more than 3-5 nodes while I have 8. I just want to know > whether it is possible for it to use the first 3 nodes for the first job > and the next 3 for the next and so on. > I think it is not doing so right now and whether it would be possible > with this specific implementation is unsure. > If I must migrate to something else I am ready, I just need to know what > I should know from you people. Just specify a machinefile when you do mpirun. Have the first set in the first machinefile, run with that, and run a second job with a second machinefile. -- Geoffrey D. Jacobs From amjad11 at gmail.com Tue Jun 23 21:23:39 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 24 Jun 2009 09:23:39 +0500 Subject: [Beowulf] Parallel Programming Question Message-ID: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> Hello all, In an mpi parallel code which of the following two is a better way: 1) Read the input data from input data files only by the master process and then broadcast it other processes. 2) All the processes read the input data directly from input data files (no need of broadcast from the master process). Is it possible?. Thank you very much. Regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brockp at umich.edu Mon Jun 22 20:05:51 2009 From: brockp at umich.edu (Brock Palen) Date: Mon, 22 Jun 2009 23:05:51 -0400 Subject: [Beowulf] PGI_POST_COMPILE_OPTS In-Reply-To: <4A402079.6060107@tuffmail.us> References: <4A402079.6060107@tuffmail.us> Message-ID: On Jun 22, 2009, at 8:23 PM, Alan Louis Scheinine wrote: On the Cray systems I have access to (Kraken) this variable is set to -tp barcelona-64 Which just tells the compiler which CPU to optimize for. I would like to use it on our system where we have a mix of AMD revisions, using PGI's unified binary, to support all of these -tp x64,amd64e,barcelona-64 Now I am sure many will argue against setting a value like this by default on a system, but its much better solution than dealing one at a time what SIGILL means. Brock Palen > I do not know where and how PGI_POST_COMPILE_OPTS is used, > nonetheless, why would you want to set this option rather than > CFLAGS and LDFLAGS in those cases where an application uses those > environment variables? The Cray compilation procedure adds a great > deal to the compile or link command so you would need to look at > how it is used in the actual compilation command, in particular, > using the option "-v". Changing PGI_POST_COMPILE_OPTS and looking > for your change in the compilation line revealed with -v might > give you confidence about how it is used in the Cray compilation > and linking but I don't think Cray expects the user to change > this variable -- unless he or she is cross-compiling. > > Alan > > -- > > Alan Scheinine > 200 Georgann Dr., Apt. E6 > Vicksburg, MS 39180 > > Email: alscheinine at tuffmail.us > Mobile phone: 225 288 4176 > > From john.hearns at mclaren.com Mon Jun 22 23:37:19 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 23 Jun 2009 07:37:19 +0100 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A400AFE.6030900@gmx.com> References: <4A400AFE.6030900@gmx.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0C463AE7@milexchmb1.mil.tagmclarengroup.com> > > I've sent this mail as a direct reply to Mr. John Hearns, so I'm > sending > it again to the list - my apologies to Mr. Hearns. Emmm... John please! > Hearns, John wrote: > > > > > > I would guess you are looking are looking at using OpenFOAM for the > CFD > > > solver? > > Of course, what else is there? :)) You can download a trial version of CD-Adapco software, But I Agree for the budget point of 0 there's not much choice! > I'm also worried about the coarse grained nature of CFD simulations if > I > go down this road - the IPC is really low with respect to RAM access > and > data manipulation (again, I might be wrong, please correct me). Having > 4 > cores accessing 2GB of RAM all the time... can this be a problem? You're going to need more than 2 or 4 Gbytes memory for any real CFD usage. Seriously, these days people are speccing 4gigs for office workstations. Regarding memory bandwidth, note that I said Nehalem - which are getting good benchmark results For CFD codes. IF I were you I would seriously be considering my budget and looking at a as good a Nehalem Workstation as I can afford. Sorry, I know this goes against the list's origins in making good use of cheap or surplus hardware, And I hang mu head in shame. I also hang my head in shame at using Outlook which Capitalises Every Line. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From brockp at umich.edu Tue Jun 23 19:03:35 2009 From: brockp at umich.edu (Brock Palen) Date: Tue, 23 Jun 2009 22:03:35 -0400 Subject: [Beowulf] PGI_POST_COMPILE_OPTS In-Reply-To: <4A4126E9.1060904@ldeo.columbia.edu> References: <4A402079.6060107@tuffmail.us> <4A404F43.8030507@tuffmail.us> <4A4126E9.1060904@ldeo.columbia.edu> Message-ID: <1A57C436-35B0-4B0D-8805-ADB6BB683E64@umich.edu> Yeah you appear right, did some small tests, does not work on a normal machine, damn, just regular documentation (does it ever get read by grad students?). yeah looks like we dug up all the same stuff. Glad you enjoy the show! FYI, Jeff and I are skipping a show, I am out all this week in DC at TeraGrid 09, which for a show that has ~150 consistent listeners, no one has called me out and the guy from the show. More audience to reach I guess :-) Thanks for the input! Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Jun 23, 2009, at 3:03 PM, Gus Correa wrote: > Hi Brock, Alan, list > > First off ... > > Brock: thanks for the RCE Podcast. > It is great! > For those who don't know it, > and are interested in all aspects of HPC, > here is the link: > > http://www.rce-cast.com/ > > Brock Palen wrote: > > I noticed that the cray XT machines are setting an environment > variable > > 'PGI_POST_COMPILE_OPTS' to -tp barcelona-64 > > > > Is this a cray specific thing? > > It seems to be a Cray thing indeed. > There are equivalent flags for GNU and PATHSCALE on Cray also. > See these links: > > http://www.nccs.gov/computing-resources/jaguar/software/? > &software=libsci > http://www.nics.tennessee.edu/user-support/software?&software=libsci > > See also pages 65-67 of this Cray document: > > http://docs.cray.com/books/S-2393-13/S-2393-13.pdf > > I would guess the Cray "libsci" somehow adds those flags on the fly > when you launch the PGI compilers (or alias the pgi compiler > commands). > > On our (non-Cray) Linux cluster with > AMD quad-core I have to include those optimization flags by hand. > In our case, for AMD Shanghai it is -tp shanghai-64. > Likewise for Intel machines (with different -tp flags, of course). > > I have checked a number of PGI files and there is no such a thing > as a PGI_POST_COMPILE_OPTS environment variable. > For instance, their suggestions for "module" files set these > environment > variables: > PGI, CC, CPP, CXX, FC, F77, F90, PATH, MANPATH, and LD_LIBRARY_PATH. > However, no machine-dependent optimization flag is set, > no PGI_POST_COMPILE_OPTS either. > (The default machine-independent optimization level, > according to man pgf90, is -01.) > > You can check these module.skel files in: > pgi/8.0-4/linux86-64/8.0-4/etc/modulefiles/ (whatever version you have > mine is 8.0-4). > > Note also that the PGI directory tree has a specific "cray" > directory, with several libraries, which suggests that PGI does some > special voodoo on Cray machines. > > Intel has the ifort.cfg, icc.cfg, icpc.cfg files that allow the > compilers to be configured/customized with user/system options. > I couldn't find an equivalent scheme on PGI. > > My $0.02. > Gus Correa > > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > the -tp flag is a normal pgi compiler > > option, I want to know if I can use this variable to set options for > > specific people (module files), and not make a change global in > localrc. > > > > I can't find it documented any place, other than the cray module > files, > > and I can't find its value any place looking on NICS Kraken, any > input > > would be great to know if I can use this option or not. > > > > > > Brock Palen > > www.umich.edu/~brockp > > Center for Advanced Computing > > brockp at umich.edu > > (734)936-1985 > > > > > > > > Alan Louis Scheinine wrote: >> Brock Palen, >> In your original posting you spoke of setting options using >> PGI_POST_COMPILE_OPTS >> without specifying the type of option. In your follow-up post you >> wrote >>> using PGI's unified binary, to support all of these >>> -tp x64,amd64e,barcelona-64 >> Interesting. A PGI WWW page says: >> PGI compilers can generate a single PGI Unified Binary? executable >> > fully optimized for both Intel EM64T and AMD64 processors, >> delivering >> > all the benefits of a single x64 platform while enabling you to >> leverage >> > the latest innovations from both Intel and AMD. >> For different AMD revisions also in the same binary? Reading the >> manual, it seems >> that there are no restrictions on what can be in the "-tp" list. >> The environment variable PGI_POST_COMPILE_OPTS seems like the >> appropriate >> place to set the "-tp" options. Tell use whether the Cray digests >> this >> innovative use. >> Alan > > > From mm at yuhu.biz Wed Jun 24 00:22:03 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed, 24 Jun 2009 10:22:03 +0300 Subject: [Beowulf] Erlang Usage Message-ID: <200906241022.03284.mm@yuhu.biz> Hello, I'm currently learning Erlang and I'm curious have any of you guys have ever used Erlang on their clusters? Have anyone experimented in doing any academic work with it? -- Best regards, Marian Marinov From tegner at nada.kth.se Wed Jun 24 00:36:45 2009 From: tegner at nada.kth.se (Jon Tegner) Date: Wed, 24 Jun 2009 09:36:45 +0200 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0C463AE7@milexchmb1.mil.tagmclarengroup.com> References: <4A400AFE.6030900@gmx.com> <68A57CCFD4005646957BD2D18E60667B0C463AE7@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4A41D78D.2080302@nada.kth.se> >> Hearns, John wrote: >> >>>> I would guess you are looking are looking at using OpenFOAM for There is also overture https://computation.llnl.gov/casc/Overture/ using overlapping grids. Complete with gridgenerator and a bunch of solvers. Excellent software! /jon From eugen at leitl.org Wed Jun 24 01:31:43 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 24 Jun 2009 10:31:43 +0200 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A416431.80703@gmail.com> References: <4A400AFE.6030900@gmx.com> <4A410183.9000405@gmx.com> <4A416431.80703@gmail.com> Message-ID: <20090624083143.GX23524@leitl.org> On Tue, Jun 23, 2009 at 06:24:33PM -0500, Geoff Jacobs wrote: > Transfer speed gets a speed bump with 10k Velociraptors, but it's not Velociraptors are mostly for high-IOPS applications, they're not that fast. http://www.storagereview.net/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=321&devID_1=358&devID_2=360&devID_3=354&devCnt=4 > that huge a bump. You're better off buying cheap WD Caviar Greens and > spending the difference on RAM. Please note, RAID is not a backup > method, it's to increase the resiliency of the filesystem. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From tjrc at sanger.ac.uk Wed Jun 24 01:33:42 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 24 Jun 2009 09:33:42 +0100 Subject: [Beowulf] Erlang Usage In-Reply-To: <200906241022.03284.mm@yuhu.biz> References: <200906241022.03284.mm@yuhu.biz> Message-ID: <717792A6-AE5B-4C72-9203-F0AE297B850A@sanger.ac.uk> On 24 Jun 2009, at 8:22 am, Marian Marinov wrote: > Hello, > I'm currently learning Erlang and I'm curious have any of you guys > have ever > used Erlang on their clusters? > > Have anyone experimented in doing any academic work with it? Only indirectly - my only encounter with it is people using couchdb, which is implemented in it. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From tomislav.maric at gmx.com Wed Jun 24 03:02:40 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed, 24 Jun 2009 12:02:40 +0200 Subject: [Beowulf] noobs: what comes next? In-Reply-To: <4A41D78D.2080302@nada.kth.se> References: <4A400AFE.6030900@gmx.com> <68A57CCFD4005646957BD2D18E60667B0C463AE7@milexchmb1.mil.tagmclarengroup.com> <4A41D78D.2080302@nada.kth.se> Message-ID: <4A41F9C0.7000607@gmx.com> Jon Tegner wrote: > > using overlapping grids. Complete with gridgenerator and a bunch of > solvers. Excellent software! > > /jon > I'll certainly take a look at it, but from what I've read on the main page, it uses structured and curvilinear grids, while OF also supports polyhedral finite volumes and unstructured grids that are better suited for meshing complex geometries. I'm still a graduate student, but that doesn't stop me from being amazed by its capabilities. If You are interested, take a look at: http://powerlab.fsb.hr/ped/kturbo/OpenFOAM/ there's tons of documentation (PhDs, articles from scientific journals, movies, etc.) in that repo, and if You wish, I can send You the release notes for the -dev version - it's worth taking a look at, because of the numerous solvers and utilities listed in one place. Best regards, Tomislav From deadline at eadline.org Wed Jun 24 06:46:45 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed, 24 Jun 2009 09:46:45 -0400 (EDT) Subject: [Beowulf] Erlang Usage In-Reply-To: <200906241022.03284.mm@yuhu.biz> References: <200906241022.03284.mm@yuhu.biz> Message-ID: <58963.192.168.1.213.1245851205.squirrel@mail.eadline.org> I have played with Erlang quite a bit (not built any real programs however) I like Erlang for several reasons - functional - open and freely available - production level implementation (Ericsson) You can write Erlang programs that run on one core, multiple cores, or multiple servers. Erlang is based on concurrent processes and messages. It also has the capability to "hot swap" code while programs are running. As far as HPC, Erlang is interpreted although it can be "compiled" and it does not directly support the idea of an array. Plus if you are not familiar with functional languages, it may seem a bit weird. If you want to check it out, you can download a version at erlang.org (it is often part of many distributions). Plus have pick up a copy of Joe Armstrong's "Programming Erlang: Software for a Concurrent World" -- Doug > Hello, > I'm currently learning Erlang and I'm curious have any of you guys have > ever > used Erlang on their clusters? > > Have anyone experimented in doing any academic work with it? > > -- > Best regards, > Marian Marinov > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From mm at yuhu.biz Wed Jun 24 07:39:10 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed, 24 Jun 2009 17:39:10 +0300 Subject: [Beowulf] Erlang Usage In-Reply-To: <58963.192.168.1.213.1245851205.squirrel@mail.eadline.org> References: <200906241022.03284.mm@yuhu.biz> <58963.192.168.1.213.1245851205.squirrel@mail.eadline.org> Message-ID: <200906241739.11023.mm@yuhu.biz> On Wednesday 24 June 2009 16:46:45 Douglas Eadline wrote: > I have played with Erlang quite a bit (not built any > real programs however) I like Erlang for several reasons > > - functional > - open and freely available > - production level implementation (Ericsson) > > You can write Erlang programs that run on one core, > multiple cores, or multiple servers. Erlang is based on > concurrent processes and messages. It also has the > capability to "hot swap" code while programs are running. > > As far as HPC, Erlang is interpreted although it can be "compiled" > and it does not directly support the idea of an array. Plus > if you are not familiar with functional languages, it may seem > a bit weird. > > If you want to check it out, you can download a version > at erlang.org (it is often part of many distributions). > Plus have pick up a copy of Joe Armstrong's "Programming > Erlang: Software for a Concurrent World" > > -- > Doug > I'm already writing some programs in Erlang(also no real work done). I'm already familiar with most of it and also read Joe Amstrong's book :) It is really strange but in a good way :) Combining ideas of different programming languages. It is fun to write in it. But still I don't have any real application for it, this is why I asked if anyone has done any real work with it. Regards, Marian From hahn at mcmaster.ca Wed Jun 24 08:44:03 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 24 Jun 2009 11:44:03 -0400 (EDT) Subject: [Beowulf] Parallel Programming Question In-Reply-To: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> Message-ID: > In an mpi parallel code which of the following two is a better way: > > 1) Read the input data from input data files only by the master process > and then broadcast it other processes. > > 2) All the processes read the input data directly from input data files > (no need of broadcast from the master process). Is it possible?. 2 is certainly possible; whether it's any advantage depends too much on your filesystem, size of data, etc. I'd expect 2 to be faster only if your file setup is peculiar - for instance, if you can expect all nodes to have the input files cached already. otherwise, with a FS like NFS, 2 will lose, since MPI broadcast is almost certainly more time-efficient than N nodes all fetching the file separately. but you should ask whether the data involved is large, and whether each rank actually needs it. if each rank needs only a different subset of data, then reading separately could easily be faster. From gus at ldeo.columbia.edu Wed Jun 24 12:21:14 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 24 Jun 2009 15:21:14 -0400 Subject: [Beowulf] Parallel Programming Question In-Reply-To: References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> Message-ID: <4A427CAA.9050304@ldeo.columbia.edu> Hi Amjad, list Mark Hahn said it all: 2 is possible, but only works efficiently under certain conditions. In my experience here, with ocean, atmosphere, and climate models, I've seen parallel programs with both styles of I/O (not only input!). Here is what I can tell about them. #1 is the traditional way, the "master" processor reads the parameter (e.g. Fort ran namelists) and data file(s), broadcasts parameters that are used by all "slave" processors, and scatters any data that will be processed in a distributed fashion by each "slave" processor. E.g., the seawater density may be a parameter used by all processors, whereas the incoming solar radiation on a specific part of the planet, is only used by the particular processor that is handling that specific area of the world. For output it is about the same procedure in the reverse sense, i.e., the "master" processor gathers the data from all "slave" processors, and writes the output file(s). That always works, there is no file system contention. In case the data is too big for the master node memory capacity, one can always split the I/O in smaller chunks. One drawback of this mode is that it halt the computation while the "master" is doing I/O. There is also some communication cost associated to broadcasting, scattering, or gathering data. Another drawback is that you need to write more code for the I/O procedure. In my experience this cost is minimal, and this mode pays off. Normally funneling I/O through the "master" processor takes less time than the delays caused by the contention that would be generated by many processors trying to access the same files, say, on a single NFS file server. (If you have a super-duper parallel file system then this may be fine.) In addition, MPI is in control of everything, you are less dependent on NFS quirks. I really prefer this mode #1. However, I must say that the type of program we run here doesn't do I/O very often, typically once every 1000 time steps (order of magnitude), with lots of computation in-between. For I/O intensive programs you may need strategy #2. #2 is used by a number of programs we run. Not that they really need to do it, but because it takes less coding, I suppose. They always cause a problem when the number of processors is big. Most of these programs take a cavalier approach to I/O, which does not take into account the actual filesystem in use (whether local disk, NFS, or parallel FS), the size of the files, the number of processors in use, etc. I.e. they tend to ignore all the points that Mark mentioned as important. Often times these codes were developed on big iron machines, ignoring the hurdles one has to face on a Beowulf. In general they don't use MPI parallel I/O either (or other MPI-I/O developments such as HDF-5 file parallel I/O, or NetCDF-4 file parallel I/O). These programs may improve in the future, but the current situation is this: brute force "parallel" I/O. (The right name would be something else, not "parallel I?O", perhaps "contentious I/O", "stampede I/O", "looting I/O", or other.) I have to run these programs on a restricted number of processors, to avoid NFS to collapse. Somewhere around 32-64 processors real problems start to happen, and much earlier than that if your I/O and MPI networks are the same. You can do some tricks, like increasing the number of NFS daemons, changing the buffer size, etc, but there is a limit to how far the tricks can go. Parallel file systems presumably handle this situation more smoothly (but they will cost more than a single NFS server). However, this second approach should work well if you stage in and stage out all parameter files and data to/from **local disks** on each node (although with dual-socket quad-core systems you may still have 8 processors reading the same files at the same time). This is typically done by the script you submit to the resource manager/queue system. The script stages in the input files, then launches the parallel program (mpirun), and at the end stages out the output files. However, in the field I work, "stage-in/stage-out" data files has mostly been phased out and replaced by files on NFS mounted directories. (I read here and elsewhere that I/O intensive parallel programs - genome research, computational biology, maybe computational chemistry - continue to use "stage-in/stage-out" precisely to avoid contention over NFS, and to avoid paying more for a parallel file system.) So, for low I/O-to-computation ratio, I suggest using #1. For high I/O-to-computation ratio, maybe #2 using local disks and stage-in/stage-out (if you want to keep cost low). As usual, YMMV. :) I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Mark Hahn wrote: >> In an mpi parallel code which of the following two is a better way: >> >> 1) Read the input data from input data files only by the master >> process >> and then broadcast it other processes. >> >> 2) All the processes read the input data directly from input data >> files >> (no need of broadcast from the master process). Is it possible?. > > 2 is certainly possible; whether it's any advantage depends too much > on your filesystem, size of data, etc. I'd expect 2 to be faster only > if your file setup is peculiar - for instance, if you can expect all > nodes to have the input files cached already. otherwise, with a FS like > NFS, 2 will lose, since MPI broadcast is almost certainly more > time-efficient than N nodes all fetching the file separately. > > but you should ask whether the data involved is large, and whether each > rank actually needs it. if each rank needs only a different subset of > data, then reading separately could easily be faster. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From eugen at leitl.org Thu Jun 25 05:41:52 2009 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 25 Jun 2009 14:41:52 +0200 Subject: [Beowulf] Water-cooled IBM supercomputer to heat buildings Message-ID: <20090625124152.GJ23524@leitl.org> http://news.cnet.com/8301-11386_3-10272069-76.html June 24, 2009 12:07 PM PDT Water-cooled IBM supercomputer to heat buildings by Manek Dubash IBM and the Swiss Federal Institute of Technology at Zurich plan to build a water-cooled supercomputer whose surplus heat will be re-used to heat the university's buildings. The Aquasar supercomputer will be located at the ETH Zurich facility, and it will start operations next year, the partners said in an announcement on Tuesday. Water flows along copper pipes in a blade server used in the Aquasar supercomputer. (Credit: IBM) The supercomputer will combine two rack-mounted IBM BladeCenter servers, each containing multiple blades with a mixed population of IBM PowerXCell 8i and Intel Nehalem processors. It is expected to deliver a peak performance of about 10 teraflops. The installation will re-use heat directly for in-building heating. IBM estimates that the wate-rcooling scheme will reduce the system's carbon footprint by up to 85 percent and save up to 30 tons of carbon dioxide annually, compared with standard cooling approaches. The comparison calculations are based on average yearly operation of the system and on in-building heating energy being produced by fossil fuels, the company said. The energy-consuming refrigeration units used by almost every data center consume about half of the a data center's energy. Aquasar will need no such equipment. As a result, it should reduce overall energy consumption by 40 percent, according to IBM. "Energy is arguably the number-one challenge humanity will be facing in the 21st century. We cannot afford anymore to design computer systems based on the criterion of computational speed and performance alone," Professor Poulikakos of ETH Zurich, the leader of the Aquasar project, said in a statement. "The new target must be high-performance and low-net power consumption supercomputers and data centers. This means liquid cooling." The system is the product of an extended joint research project between ETH and IBM scientists, focused on chip-level water-cooling. It also encompasses a concept for "water-cooled data centers with direct energy re-use" proposed by scientists at IBM's Zurich Lab. Aquasar's use of warm water rather than cold water for cooling is unique and IBM-patented, a spokesman for the company said. Water, which is about 4,000 times more efficient as a coolant than air, will enter the system at 60 degrees C. This will keep the chips in the system at operating temperatures below their maximum of 85 degrees C, according to IBM. The high input temperature of the coolant results in an even higher-grade heat as an output, which in this case will be about 65 degrees C, the company said. The system uses jet impingement cooling, which means that water makes direct contact with the back of the chip via micro-channels in the heat sink, according to research papers by the IBM and ETH scientists involved in the Aquasar project. "This method incurs neither the thermal resistance overhead of a base plate, nor the overhead and reliability problem of thermal interface materials, and thus is promising for removing highest-power densities," according to one paper. Pipelines from the individual blades link to the server rack's water-pipe network, which in turn is connected to the main water transportation network. Aquasar will need about 10 liters of water for cooling, pumped at some 30 liters per minute, IBM said. The cooling system is a closed circuit: the water is heated by the chips and cooled to the required temperature as it passes through a passive heat exchanger, delivering the removed heat directly to the heating system of the university. Aquasar will be used by the computer science department at ETH Zurich for multiscale flow simulations related to nanotechnology and fluid dynamics. Researchers plan to show that solving scientific problems efficiently can be performed in an energy-efficient manner. Manek Dubash of ZDNet UK reported from London. From rpnabar at gmail.com Thu Jun 25 11:09:19 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 25 Jun 2009 13:09:19 -0500 Subject: [Beowulf] dedupe filesystem In-Reply-To: <1243964380.30944.8.camel@localhost.localdomain> References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman wrote: > Fdupes scans the filesystem looking for files where the size matches, if > it does it md5's them checking for matches and if that matches it > finally does a byte-by-byte compare to be 100% sure. > Why is a full byte-by-byte comparison needed even after a md5 sum matches? I know there is a vulnerability in md5 but that's more of a security thing and by random chance super unlikely , right? Or, why not use another checksum that is as yet not vulnerable? SHA1? SHA2? etc.? Or are they way too expensive to compute? Just curious.... -- Rahul -------------- next part -------------- An HTML attachment was scrubbed... URL: From stewart at serissa.com Thu Jun 25 11:44:57 2009 From: stewart at serissa.com (Lawrence Stewart) Date: Thu, 25 Jun 2009 14:44:57 -0400 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: <4A43C5A9.60808@serissa.com> Rahul Nabar wrote: > > > On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman > wrote: > > Fdupes scans the filesystem looking for files where the size > matches, if > it does it md5's them checking for matches and if that matches it > finally does a byte-by-byte compare to be 100% sure. > > > Why is a full byte-by-byte comparison needed even after a md5 sum > matches? I know there is a vulnerability in md5 but that's more of a > security thing and by random chance super unlikely , right? > > Or, why not use another checksum that is as yet not vulnerable? SHA1? > SHA2? etc.? Or are they way too expensive to compute? > > Just curious.... > > -- > Rahul > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > It is a tradeoff between the size of the hash and the amount of storage it takes. Due to the "birthday problem," when you use an N bit hash, the probability of an erroneous hash collision rises to about 1/2 when the number of blocks in the file system rises to 2**(N/2). If you have 64 bit hashes and 1024 byte blocks, for example, this happens with about 4 TB of data, which is not much. You could choose to use 128 bit hashes, and cut down this problem by another factor of 2**32, but it may work out cheaper to do the real compare and be sure, one time out of 2**32 than to double the storage for hashes. If you going to depend on hash collisions meaning the data is identical, then you really need 256 bit hashes, probably, because you want the probability of an erroneous collision to be much smaller than other sources of data loss, which ought to be down in the 2**-45 area. (1e-15 ish). The security properties of the hashes are not so important, just their randomness. -Larry From amjad11 at gmail.com Fri Jun 26 06:50:19 2009 From: amjad11 at gmail.com (amjad ali) Date: Fri, 26 Jun 2009 18:50:19 +0500 Subject: [Beowulf] f77 in f90 Message-ID: <428810f20906260650u274e4504mbf72722f10299375@mail.gmail.com> Hi, all, I am parallelizing a serial-fortran77 code having .f source files. I am planning to save the fortran source file as .f90 instead of .f files. Then I would use the mpif90 compiler. The fortran77 constructs in the code are understood by mpif90 compiler (i feel so). Will there be any performance loss (or gain) in compiling the fortran77 code with mpif90 (where the code will be containing some fortran90 constructs as well)? Any opinion? An advantage with mpif90 is that I can use ALLOCATABLE arrays for some basic array variables; saving memory. THANKS A LOT FOR YOUR KIND ATTENTION. With best regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From prentice at ias.edu Fri Jun 26 07:09:03 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 26 Jun 2009 10:09:03 -0400 Subject: [Beowulf] f77 in f90 In-Reply-To: <428810f20906260650u274e4504mbf72722f10299375@mail.gmail.com> References: <428810f20906260650u274e4504mbf72722f10299375@mail.gmail.com> Message-ID: <4A44D67F.6070803@ias.edu> amjad ali wrote: > Hi, all, > > I am parallelizing a serial-fortran77 code having .f source files. I am > planning to save the fortran source file as .f90 instead of .f files. > Then I would use the mpif90 compiler. The fortran77 constructs in the > code are understood by mpif90 compiler (i feel so). You're feelings are correct. F90 is a superset of F77, so any correct F77 code is also correct F90 code. > > Will there be any performance loss (or gain) in compiling the fortran77 > code with mpif90 (where the code will be containing some fortran90 > constructs as well)? Any opinion? I doubt it. Most modern Fortran compilers do F77 and F90 (ifort and gfortran for example), so the same optimization logic is being used. > > An advantage with mpif90 is that I can use ALLOCATABLE arrays for some > basic array variables; saving memory. -- Prentice From rpnabar at gmail.com Fri Jun 26 09:14:55 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 26 Jun 2009 11:14:55 -0500 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <9f8092cc0906021018n523a7081r744e0847691300fe@mail.gmail.com> Message-ID: On Tue, Jun 2, 2009 at 11:56 PM, Matt Lawrence wrote: > I have found a great deaal of duplication in install trees, particularly > when you just want to install the latest. ?I've managed to get some massive > savings with NIM on AIX and some lesser but stll very good savings with > CentOS by building parallel trees and hard linking the files. I'm jumping on this discussion late but I found a great number (but not so much in size I suspect) of dupes when I ran fdupes on my drives now. Some in quite unexpected places. eg. we have job submission shell wrappers for PBS/torque. The number of dupes indicates that often people are not removing these files at all after the job finishes. So, more than an automated-dedup solution fdupes has helped me know what are the filesystem cleanup opportunities I have! The next time someone comes running for a disk quota increase I know what to check! ;-) -- Rahul From dzaletnev at yandex.ru Fri Jun 26 09:36:04 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Fri, 26 Jun 2009 20:36:04 +0400 Subject: [Beowulf] Cluster Networking Message-ID: <60361246034164@webmail18.yandex.ru> Hi, I have two questions: 1. Is there any influence on performance of a NFS-server from the usage of x32 CPU and OS instead of x64, if all other characteristics of the system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are the same? 2. Is the Etherchannel technology (channel bonding) useful for CFD-application running on a cluster of two Dual-Xeon servers? Sincerely, Dmitry Zaletnev From john.hearns at mclaren.com Fri Jun 26 09:36:54 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 26 Jun 2009 17:36:54 +0100 Subject: [Beowulf] One rack petaflop machine Message-ID: <68A57CCFD4005646957BD2D18E60667B0C4FDC9E@milexchmb1.mil.tagmclarengroup.com> Gentlemen, ladies, the challenge has been thrown down: http://www.theregister.co.uk/2009/06/26/darpa_supercomputer_slim_size/ Now, is that theoretical peaks petaflops, or HPL petaflops ;-) The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From carsten.aulbert at aei.mpg.de Fri Jun 26 10:56:09 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Fri, 26 Jun 2009 19:56:09 +0200 Subject: [Beowulf] Cluster Networking In-Reply-To: <60361246034164@webmail18.yandex.ru> References: <60361246034164@webmail18.yandex.ru> Message-ID: <4A450BB9.4000802@aei.mpg.de> Hi Dmitry Zaletnev wrote: > 1. Is there any influence on performance of a NFS-server from the usage of x32 CPU and OS instead of x64, if all other characteristics of the system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are the same? Not too much I would think. Except maybe the amount of memory a single process can access or the OS can use as buffer cache. It can be quite a difference if a file server can hold 3 GB or 32 GB of files in memory if these files are accessed quite often. But I would tend to think that it won't matter too much. > > 2. Is the Etherchannel technology (channel bonding) useful for CFD-application running on a cluster of two Dual-Xeon servers? I don't know, if the traffic between both nodes is limited by the bandwidth it will help, but if it's just latency it might be bad to use channel bonding as this usually takes up quite a bit of CPU power and I have no idea about the impact on latency. HTH Carsten From skylar at cs.earlham.edu Fri Jun 26 11:03:01 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Fri, 26 Jun 2009 11:03:01 -0700 Subject: [Beowulf] Cluster Networking In-Reply-To: <60361246034164@webmail18.yandex.ru> References: <60361246034164@webmail18.yandex.ru> Message-ID: <4A450D55.2080205@cs.earlham.edu> Dmitry Zaletnev wrote: > Hi, I have two questions: > > 1. Is there any influence on performance of a NFS-server from the usage of x32 CPU and OS instead of x64, if all other characteristics of the system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are the same? > If you run 32-bit, make sure you have PAE enabled if you have more than 4GB of RAM. You'll still get some overhead when accessing more than 4GB of cache, since the kernel will have to swap out page tables for each 4GB chunk of physical RAM. A 64-bit kernel would be able to address all the RAM you have without having to resort to page table trickery. > 2. Is the Etherchannel technology (channel bonding) useful for CFD-application running on a cluster of two Dual-Xeon servers? > There's a variety of different ways to do channel bonding. I would read up on and experiemnt with all of them and choose one that ensures that you actually balance across all the NICs in your systems. It sounds like you have a small cluster, which means that using the MAC address for the hash is likely going to cause both your cluster systems to hash to the same NIC, so you won't get any balancing at all. On Linux, you can also use a combination of MAC and IP, or IP and port, which will likely be better at balancing the load. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature URL: From laytonjb at att.net Fri Jun 26 11:30:06 2009 From: laytonjb at att.net (Jeff Layton) Date: Fri, 26 Jun 2009 14:30:06 -0400 Subject: [Beowulf] Cluster Networking In-Reply-To: <60361246034164@webmail18.yandex.ru> References: <60361246034164@webmail18.yandex.ru> Message-ID: <4A4513AE.3090502@att.net> Dmitry Zaletnev wrote: > Hi, I have two questions: > > 1. Is there any influence on performance of a NFS-server from the usage of x32 CPU and OS instead of x64, if all other characteristics of the system, i.e. amount of RAM, soft-SATA-II RAID 0, Realtek GLAN NIC are the same? > > 2. Is the Etherchannel technology (channel bonding) useful for CFD-application running on a cluster of two Dual-Xeon servers? > CFD codes are usually latency sensitive. So channel bonding doesn't buy you anything and may actually hurt (but just a little). Try something like OpenMX over GigE. Much better latencies and should perform and scale better. Jeff From rpnabar at gmail.com Fri Jun 26 11:45:34 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 26 Jun 2009 13:45:34 -0500 Subject: [Beowulf] Cluster Networking In-Reply-To: <4A450BB9.4000802@aei.mpg.de> References: <60361246034164@webmail18.yandex.ru> <4A450BB9.4000802@aei.mpg.de> Message-ID: On Fri, Jun 26, 2009 at 12:56 PM, Carsten Aulbert wrote: > I don't know, if the traffic between both nodes is limited by the > bandwidth it will help, but if it's just latency it might be bad to use > channel bonding as this usually takes up quite a bit of CPU power and I > have no idea about the impact on latency. I believe channel bonding boosts bandwidth but does not affect latency. I have channel bonded my twin eth cards and did not see any degradation in latency. -- Rahul From lindahl at pbm.com Fri Jun 26 11:50:13 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 26 Jun 2009 11:50:13 -0700 Subject: [Beowulf] Cluster Networking In-Reply-To: <4A4513AE.3090502@att.net> References: <60361246034164@webmail18.yandex.ru> <4A4513AE.3090502@att.net> Message-ID: <20090626185013.GA5871@bx9.net> On Fri, Jun 26, 2009 at 02:30:06PM -0400, Jeff Layton wrote: > CFD codes are usually latency sensitive. CFD codes come in many shapes and sizes, so generalizing about them is not a good idea. Really. -- greg From rpnabar at gmail.com Fri Jun 26 11:52:22 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 26 Jun 2009 13:52:22 -0500 Subject: [Beowulf] Cluster Networking In-Reply-To: <4A4513AE.3090502@att.net> References: <60361246034164@webmail18.yandex.ru> <4A4513AE.3090502@att.net> Message-ID: On Fri, Jun 26, 2009 at 1:30 PM, Jeff Layton wrote: > Try something like OpenMX over GigE. Much better latencies > and should perform and scale better. Ah! Thanks for that lead Jeff. I did not know about OpenMX. How close does it get to native Myrinet performance? Or Infiniband. OpenMX might be a great way for our cluster too to achieve better performance without changing our eth backbone. -- Rahul From rpnabar at gmail.com Fri Jun 26 12:08:08 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 26 Jun 2009 14:08:08 -0500 Subject: [Beowulf] typical latencies for gigabit ethernet Message-ID: What are typical latencies I ought to be seeing with gigabit eth? Just curious. Want to see how bad (or good) my setup is. I haven't really tuned my eth cards. Setups is a simple: server<-->switch<-->server. These are AMD Opteron-Barcelona 2.2 Ghz procs with quad cores two sockets. -- Rahul From laytonjb at att.net Fri Jun 26 12:51:00 2009 From: laytonjb at att.net (Jeff Layton) Date: Fri, 26 Jun 2009 15:51:00 -0400 Subject: [Beowulf] Cluster Networking In-Reply-To: <20090626185013.GA5871@bx9.net> References: <60361246034164@webmail18.yandex.ru> <4A4513AE.3090502@att.net> <20090626185013.GA5871@bx9.net> Message-ID: <4A4526A4.6020100@att.net> Greg Lindahl wrote: > On Fri, Jun 26, 2009 at 02:30:06PM -0400, Jeff Layton wrote: > > >> CFD codes are usually latency sensitive. >> > > CFD codes come in many shapes and sizes, so generalizing about them is > not a good idea. Really. > I definitely agree. That's why I said "usually" :) Jeff From richard.walsh at comcast.net Fri Jun 26 13:42:01 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 26 Jun 2009 20:42:01 +0000 (UTC) Subject: [Beowulf] Cluster Networking In-Reply-To: <4A4526A4.6020100@att.net> Message-ID: <720246720.155781246048921937.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> ----- Original Message ----- From: "Jeff Layton" To: "Greg Lindahl" >> >> CFD codes come in many shapes and sizes, so generalizing about them is >> not a good idea. Really. >> >I definitely agree. That's why I said "usually" :) Sources of unpredictable (non-pipeline-able) latency in CFD include turbulent flows that move areas requiring denser meshes around your data partition space, chemistry and mixing that require the conditional invocation different kernel physics, particle tracking, and complicated fluid surface interactions among other things. Of course, the latency we are talking about here is a multi-tiered phenomenon involving non-node-local memory references, local memory indexed or non-strided reference patterns, and the companion cache-related complications. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From bill at cse.ucdavis.edu Fri Jun 26 23:30:20 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Fri, 26 Jun 2009 23:30:20 -0700 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> Message-ID: <4A45BC7C.6020908@cse.ucdavis.edu> amjad ali wrote: > Hello all, > > In an mpi parallel code which of the following two is a better way: > > 1) Read the input data from input data files only by the master process > and then broadcast it other processes. > > 2) All the processes read the input data directly from input data files > (no need of broadcast from the master process). Is it possible?. Both.... it depends on the details of course. How big are the input files? Does each client need them all, or just their fraction? If the clients read from the input files are they local to the clients or being read from a shared file system? What does your network look like? Keep in mind that when you say broadcast that many (not all) MPI implementations do not do a true network layer broadcast... and that in most situations network uplinks are distinct from the downlinks (except for the ACKs). If all clients need all input files you can achieve good performance by either using a bit torrent approach (send 1/N of the file to each of N clients then have them re-share it), or even just a simple chain. Head -> node A -> node B -> node C. This works better than you might think since Node A can start uploading immediately and the upload bandwidth doesn't compete with the download bandwidth (well not much usually). For the typical case a MPI broadcast of 1GB because 8 nodes need 128MB wouldn't be worth it. Instead just send 128MB to each client with MPI_Send. In general I see a higher percentage of peak bandwidth with MPI than I do with NFS, but NFS can be tuned to be a reasonably high fraction of wirespeed as well. Keep in mind that it's not hard to become disk limited on the head node, you might want to take a look at how you are reading the files and the bandwidth available before you go optimizing the network layer. From hahn at mcmaster.ca Sat Jun 27 15:21:50 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 27 Jun 2009 18:21:50 -0400 (EDT) Subject: [Beowulf] typical latencies for gigabit ethernet In-Reply-To: References: Message-ID: > What are typical latencies I ought to be seeing with gigabit eth? Just seems to be fairly variable. let's say 50 +- 20 microseconds. > Setups is a simple: server<-->switch<-->server. it may be instructive to try a server-server test case. From d.love at liverpool.ac.uk Sun Jun 28 04:17:50 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:17:50 +0100 Subject: [Beowulf] Re: HPC fault tolerance using virtualization References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160202m274ad417pa9f3da905c17799d@mail.gmail.com> Message-ID: <87vdmg38v5.fsf@liv.ac.uk> John Hearns writes: > However, you could look of correctable ECC errors, On the systems with which I'm familiar, they either won't show up in the IPMI SEL or will apparently be inconsistent with the kernel mcelog -- mcelog typically displays many more events. (I don't know why this is, though I'm overly familiar with memory errors.) > and for disks run a smartctl test and see if a disk is showing > symtopms which might make it fail in future. What I typically see from smartd is alerts when one or more sectors has already gone bad, although that tends not to be something that will clobber the running job. How should it be configured to do better (without noise)? From d.love at liverpool.ac.uk Sun Jun 28 04:26:47 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:26:47 +0100 Subject: [Beowulf] Re: Erlang Usage References: <200906241022.03284.mm@yuhu.biz> <58963.192.168.1.213.1245851205.squirrel@mail.eadline.org> <200906241739.11023.mm@yuhu.biz> Message-ID: <87ocs838g8.fsf@liv.ac.uk> Marian Marinov writes: > It is really strange but in a good way :) Combining ideas of different > programming languages. I'm not sure what that refers to, but I think the strangest thing about it is the Prolog-y syntax. At least the Actor-like model should be familiar from MPI, though Erlang stresses `massive' lightweight concurrency which probably isn't relevant to most HPC. It may be a mistake to concentrate just on the language, though, rather than the implementation and the OTP framework. > It is fun to write in it. But still I don't > have any real application for it, this is why I asked if anyone has > done any real work with it. If you must find a problem for this solution, Erlang/OTP would presumably be a good implementation basis for distributed management and monitoring tools if you need more tools. For actual HPC computational work, look for "High-performance Technical Computing with Erlang", which I think was in the proceedings of last year's Erlang user group meeting, but I'm not entirely convinced by that. Note that as far as I know, Erlang/OTP doesn't support Infiniband or MX. From d.love at liverpool.ac.uk Sun Jun 28 04:30:36 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:30:36 +0100 Subject: [Beowulf] Re: Erlang Usage References: <200906241022.03284.mm@yuhu.biz> <58963.192.168.1.213.1245851205.squirrel@mail.eadline.org> Message-ID: <87hby0389v.fsf@liv.ac.uk> "Douglas Eadline" writes: > As far as HPC, Erlang is interpreted although it can be "compiled" Surely it's normally compiled, whether to byte code or not. Although you don't want to be doing serious numerical work in it directly, I guess there's nothing fundamental to prevent it being compiled as well as Lisp for numerics. [Maybe Termite, which has a similar model, can do better as it's based on one of the best Scheme compilers.] > and it does not directly support the idea of an array. Well, there is a (functional) array module. From d.love at liverpool.ac.uk Sun Jun 28 04:37:50 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:37:50 +0100 Subject: [Beowulf] Re: f77 in f90 References: <428810f20906260650u274e4504mbf72722f10299375@mail.gmail.com> <4A44D67F.6070803@ias.edu> Message-ID: <87ab3s37xt.fsf@liv.ac.uk> Prentice Bisbal writes: >> Will there be any performance loss (or gain) in compiling the fortran77 >> code with mpif90 (where the code will be containing some fortran90 >> constructs as well)? Any opinion? > > I doubt it. Most modern Fortran compilers do F77 and F90 (ifort and > gfortran for example), so the same optimization logic is being used. I've used one fairly recently where the f77 and f90 front ends were different, but I can't remember which. I'm not sure if it's what the OP means, but in principle you win with f90 constructs, like array operations, because the compiler has more scope with greater abstraction. However, you may actually do better with explicit loops, for instance, depending on the particular code and compiler. There were benchmark data on that from some ago, but doubtless things have improved since. From d.love at liverpool.ac.uk Sun Jun 28 04:45:09 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:45:09 +0100 Subject: [Beowulf] Re: Cluster Networking References: <60361246034164@webmail18.yandex.ru> <4A4513AE.3090502@att.net> Message-ID: <873a9k37lm.fsf@liv.ac.uk> Rahul Nabar writes: > On Fri, Jun 26, 2009 at 1:30 PM, Jeff Layton wrote: >> Try something like OpenMX over GigE. Much better latencies ?6?s, if that counts as much better. >> and should perform and scale better. Are there data on that? I'm not clear how much more efficient than TCP it might be CPU-wise, for instance, and I'm not sure how best to check. > How close does it get to native Myrinet performance? Or Infiniband. Not at all for Infiniband. With the right NICs on two rails, it's competitive with our Myrinet-2000 system. See open-mx.org for 10G data, but they're presumably not relevant to you. > OpenMX might be a great way for our cluster too to achieve better > performance without changing our eth backbone. In principle with Open MPI, it should use the two rails (NICs) to double the bandwidth as with TCP; that's currently broken, although Manchester seem to be getting away with it somehow. Brice will get back to fixing it when he returns in a couple of weeks. From d.love at liverpool.ac.uk Sun Jun 28 04:47:14 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Sun, 28 Jun 2009 12:47:14 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet References: Message-ID: <87vdmg1sxp.fsf@liv.ac.uk> Rahul Nabar writes: > What are typical latencies I ought to be seeing with gigabit eth? I have some data at http://www.nw-grid.ac.uk/LivMPI Note that you may have to tune the driver parameters to get the lowest latency. From gerry.creager at tamu.edu Sun Jun 28 09:15:23 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sun, 28 Jun 2009 11:15:23 -0500 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87vdmg1sxp.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> Message-ID: <4A47971B.3000203@tamu.edu> Dave Love wrote: > Rahul Nabar writes: > >> What are typical latencies I ought to be seeing with gigabit eth? > > I have some data at http://www.nw-grid.ac.uk/LivMPI > Note that you may have to tune the driver parameters to get the lowest > latency. Dave, Can you say something about any tuning you did to get decent results? We've had some interesting problems (until lately; new broadcom drivers helped) getting decent performance. We're using Dell 1950's with Harpertown and broadcom NICs are native. Which driver parms had to get tweaked? Thanks, Gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From ashley at pittman.co.uk Sun Jun 28 10:19:50 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Sun, 28 Jun 2009 18:19:50 +0100 Subject: [Beowulf] dedupe filesystem In-Reply-To: References: <1243964380.30944.8.camel@localhost.localdomain> Message-ID: <1246209590.4271.11.camel@alpha> On Thu, 2009-06-25 at 13:09 -0500, Rahul Nabar wrote: > On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman > wrote: > Fdupes scans the filesystem looking for files where the size > matches, if > it does it md5's them checking for matches and if that matches > it > finally does a byte-by-byte compare to be 100% sure. > > Why is a full byte-by-byte comparison needed even after a md5 sum > matches? I know there is a vulnerability in md5 but that's more of a > security thing and by random chance super unlikely , right? > Just curious.... Checksums are a (inherently imperfect) way of checking that two files aren't different, they are not intended to and cannot prove that two files are the same. If you relied on the md5 sum alone there would be collisions and those collisions would result in you losing data. Ashley Pittman, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From lindahl at pbm.com Sun Jun 28 17:21:26 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Sun, 28 Jun 2009 17:21:26 -0700 Subject: [Beowulf] Re: HPC fault tolerance using virtualization In-Reply-To: <87vdmg38v5.fsf@liv.ac.uk> References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160202m274ad417pa9f3da905c17799d@mail.gmail.com> <87vdmg38v5.fsf@liv.ac.uk> Message-ID: <20090629002126.GA2033@bx9.net> On Sun, Jun 28, 2009 at 12:17:50PM +0100, Dave Love wrote: > > and for disks run a smartctl test and see if a disk is showing > > symtopms which might make it fail in future. > > What I typically see from smartd is alerts when one or more sectors has > already gone bad, although that tends not to be something that will > clobber the running job. How should it be configured to do better > (without noise)? That isn't noise, that's signal. You're just lucky that your running job doesn't need the data off the bad sector. You can try waiting until the job finishes before taking the node out of service; from the sounds of it, you will usually win. But if you don't have application-level end-to-end checksums of your data, how do you know if you won or not? In my big MapReduce cluster (800 data disks), about 2/3 of the time I'll see an I/O error in my application, or checksum failure, and 1/3 of the time I will see a smartd error and no application error. -- greg From d.love at liverpool.ac.uk Mon Jun 29 02:59:27 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 10:59:27 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <4A47971B.3000203@tamu.edu> (Gerry Creager's message of "Sun, 28 Jun 2009 17:15:23 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> Message-ID: <87r5x3e4xs.fsf@liv.ac.uk> Gerry Creager writes: > Can you say something about any tuning you did to get decent results? To get the lowest latency, turn off rx interrupt coalescence, either with ethtool or module parameters, depending on the driver. Of course, you may not want to turn it off completely, depending on how much load the extra interrupts cause. > We've had some interesting problems (until lately; new broadcom drivers > helped) getting decent performance. We're using Dell 1950's with > Harpertown and broadcom NICs are native. > > Which driver parms had to get tweaked? There are (at least) two different drivers for Broadcoms, and I've no idea what that hardware has. I've only used tg3, and am actually somewhat confused by it. (I haven't had time to try to understand the driver code and I'm not primarily interested in those NICs as the nVidias on our boxes seem better.) Are you using tg3 or bnx2? From d.love at liverpool.ac.uk Mon Jun 29 05:30:31 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 13:30:31 +0100 Subject: [Beowulf] Re: dedupe filesystem In-Reply-To: <1246209590.4271.11.camel@alpha> (Ashley Pittman's message of "Sun, 28 Jun 2009 18:19:50 +0100") References: <1243964380.30944.8.camel@localhost.localdomain> <1246209590.4271.11.camel@alpha> Message-ID: <878wjbdxy0.fsf@liv.ac.uk> Ashley Pittman writes: > If you relied on the md5 sum alone there would be collisions and those > collisions would result in you losing data. The question is whether the probability of collisions is high compared with other causes -- presumably hardware, assuming no-one puts figures on the software reliability. As far as I remember, the calculation for SHA-1 for Plan 9's Venti?, which no-one seems to have mentioned, says ignore collisions for petabyte filesystems. Ob-Beowulf: You can run Venti on GNU/Linux,? but I don't know how the current implementation performs. Also, GlusterFS has a `data de-duplication translator' on its roadmap, which I didn't see mentioned. -- 1. http://plan9.bell-labs.com/sys/doc/venti/venti.html 2. http://swtch.com/plan9port/ From d.love at liverpool.ac.uk Mon Jun 29 05:33:37 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 13:33:37 +0100 Subject: [Beowulf] Re: HPC fault tolerance using virtualization) In-Reply-To: <20090629002126.GA2033@bx9.net> (Greg Lindahl's message of "Sun, 28 Jun 2009 17:21:26 -0700") References: <9f8092cc0906151059x6f38f1f3r28f78fde6b09085@mail.gmail.com> <200906161038.56577.kilian.cavalotti.work@gmail.com> <9f8092cc0906160202m274ad417pa9f3da905c17799d@mail.gmail.com> <87vdmg38v5.fsf@liv.ac.uk> <20090629002126.GA2033@bx9.net> Message-ID: <877hyvdxsu.fsf_-_@liv.ac.uk> Greg Lindahl writes: >> What I typically see from smartd is alerts when one or more sectors has >> already gone bad, although that tends not to be something that will >> clobber the running job. How should it be configured to do better >> (without noise)? > > That isn't noise, that's signal. Of course I didn't mean that bad block alerts were noise. However, there is what I and a hardware expert think is noise from the default smartd configuration. I'm interested in how best to configure it for useful warnings. I did have a look OTW, of course. > You're just lucky that your running > job doesn't need the data off the bad sector. Not if the problem is, say, on /usr, which the job normally isn't going to need before it finishes. > You can try waiting > until the job finishes before taking the node out of service; from the > sounds of it, you will usually win. But if you don't have > application-level end-to-end checksums of your data, how do you know > if you won or not? I know where the job is doing i/o, and I'm not going to kill multi-day, multi-node jobs -- especially not automatically -- because there's a bad sector somewhere irrelevant. Also we have better things to worry about here, at least, than application checksums, much as they might feature in an ideal world. From gerry.creager at tamu.edu Mon Jun 29 06:55:56 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon, 29 Jun 2009 08:55:56 -0500 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87r5x3e4xs.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> Message-ID: <4A48C7EC.7090607@tamu.edu> Dave Love wrote: > Gerry Creager writes: > >> Can you say something about any tuning you did to get decent results? > > To get the lowest latency, turn off rx interrupt coalescence, either > with ethtool or module parameters, depending on the driver. Of course, > you may not want to turn it off completely, depending on how much load > the extra interrupts cause. I'll look at that and give it a try. Thanks. >> We've had some interesting problems (until lately; new broadcom drivers >> helped) getting decent performance. We're using Dell 1950's with >> Harpertown and broadcom NICs are native. >> >> Which driver parms had to get tweaked? > > There are (at least) two different drivers for Broadcoms, and I've no > idea what that hardware has. I've only used tg3, and am actually > somewhat confused by it. (I haven't had time to try to understand the > driver code and I'm not primarily interested in those NICs as the > nVidias on our boxes seem better.) Are you using tg3 or bnx2? I had rather nasty results with tg3 and abandoned it. We're using bnx2 now. The latest iteration seems (guardedly) better than the last one. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From gerry.creager at tamu.edu Mon Jun 29 06:57:19 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Mon, 29 Jun 2009 08:57:19 -0500 Subject: [Beowulf] Re: dedupe filesystem In-Reply-To: <878wjbdxy0.fsf@liv.ac.uk> References: <1243964380.30944.8.camel@localhost.localdomain> <1246209590.4271.11.camel@alpha> <878wjbdxy0.fsf@liv.ac.uk> Message-ID: <4A48C83F.3020604@tamu.edu> Dave Love wrote: > Ashley Pittman writes: > >> If you relied on the md5 sum alone there would be collisions and those >> collisions would result in you losing data. > > The question is whether the probability of collisions is high compared > with other causes -- presumably hardware, assuming no-one puts figures > on the software reliability. As far as I remember, the calculation for > SHA-1 for Plan 9's Venti?, which no-one seems to have mentioned, says > ignore collisions for petabyte filesystems. > > Ob-Beowulf: You can run Venti on GNU/Linux,? but I don't know how the > current implementation performs. Also, GlusterFS has a `data > de-duplication translator' on its roadmap, which I didn't see mentioned. Our initial results with a GlusterFS implementation led us back to NFS. Who's got a really successful GlusterFS implementation working? > -- > 1. http://plan9.bell-labs.com/sys/doc/venti/venti.html > 2. http://swtch.com/plan9port/ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From atchley at myri.com Mon Jun 29 07:26:05 2009 From: atchley at myri.com (Scott Atchley) Date: Mon, 29 Jun 2009 10:26:05 -0400 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87r5x3e4xs.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> Message-ID: On Jun 29, 2009, at 5:59 AM, Dave Love wrote: >> Can you say something about any tuning you did to get decent results? > > To get the lowest latency, turn off rx interrupt coalescence, either > with ethtool or module parameters, depending on the driver. Of > course, > you may not want to turn it off completely, depending on how much load > the extra interrupts cause. When I test Open-MX, I turn interrupt coalescing off. I run omx_pingpong to determine the lowest latency (LL). If the NIC's driver allows one to specify the interrupt value, I set it to LL-1. If the driver does not allow specifying the actual rate (i.e. it only has predetermined values), then I leave it off. The downside is lower throughput for large messages on 10G Ethernet. I don't think it matters on gigabit. Brice and Nathalie have a paper which implements an adaptive interrupt coalescing so that you do not have to manually tune anything: "Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet" available at: http://hal.inria.fr/inria-00397328 I do not know if it has been included in a release yet. Scott From landman at scalableinformatics.com Mon Jun 29 08:15:08 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 29 Jun 2009 11:15:08 -0400 Subject: [Beowulf] Re: dedupe filesystem In-Reply-To: <4A48C83F.3020604@tamu.edu> References: <1243964380.30944.8.camel@localhost.localdomain> <1246209590.4271.11.camel@alpha> <878wjbdxy0.fsf@liv.ac.uk> <4A48C83F.3020604@tamu.edu> Message-ID: <4A48DA7C.7030709@scalableinformatics.com> Gerry Creager wrote: >> Ob-Beowulf: You can run Venti on GNU/Linux,? but I don't know how the >> current implementation performs. Also, GlusterFS has a `data >> de-duplication translator' on its roadmap, which I didn't see mentioned. > > Our initial results with a GlusterFS implementation led us back to NFS. > Who's got a really successful GlusterFS implementation working? > We are seeing generally very good results with GlusterFS. Performance is in part in how you arrange your translators, how your fabric is connected, etc. If you don't work hard with it, you can get ~200 MB/s per client over SDR IB. If you work a bit harder you can get ~500 MB/s per client. You do need enough server bandwidth to be able to source the data. I think I had sent a note in the past offering some assistance with this. My bad if the note didn't get through ... or if my responses were lost in the shuffle. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From d.love at liverpool.ac.uk Mon Jun 29 08:40:47 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 16:40:47 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <4A48C7EC.7090607@tamu.edu> (Gerry Creager's message of "Mon, 29 Jun 2009 14:55:56 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <4A48C7EC.7090607@tamu.edu> Message-ID: <87my7rcakg.fsf@liv.ac.uk> Gerry Creager writes: > I had rather nasty results with tg3 and abandoned it. We're using bnx2 > now. The latest iteration seems (guardedly) better than the last one. I thought that they were for different hardware (NetXtreme I c.f. NetXtreme II, according to broadcom.com). Is that not the case? If bnx2 behaves the same as tg3 here, set rx-frames to 1 with ethtool. From d.love at liverpool.ac.uk Mon Jun 29 09:10:03 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 17:10:03 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: (Scott Atchley's message of "Mon, 29 Jun 2009 15:26:05 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> Message-ID: <87iqifc97o.fsf@liv.ac.uk> Scott Atchley writes: > When I test Open-MX, I turn interrupt coalescing off. I run > omx_pingpong to determine the lowest latency (LL). If the NIC's driver > allows one to specify the interrupt value, I set it to LL-1. Right, and that's what I did before, with sensible results I thought. Repeating it now on Centos 5.2 and OpenSuSE 10.3, it doesn't behave sensibly, and I don't know what's different from the previous SuSE results apart, probably, from the minor kernel version. If I set rx-frames=0, I see this: rx-usec latency (?s) 20 34.6 12 26.3 6 20.0 1 14.8 whereas if I just set rx-frames=1, I get 14.7 ?s, roughly independently of rx-usec. (Those figures are probably ??0.2?s.) > If the > driver does not allow specifying the actual rate (i.e. it only has > predetermined values), then I leave it off. Right. (Adaptive coalescence gave significantly higher latency with our nVidia and Intel NICs.) For others interested, this affects TCP results similarly to open-mx, though the base TCP latency is substantially worse, of course. I was going to write this up for the OMX FAQ, but was loath to without understanding the tg3 situation. > The downside is lower throughput for large messages on 10G Ethernet. I > don't think it matters on gigabit. It doesn't affect the ping-pong throughput significantly, but I don't know if it has any effect on the system overall (other cores servicing the interrupts) on `typical' jobs. > Brice and Nathalie have a paper which implements an adaptive interrupt > coalescing so that you do not have to manually tune anything: Isn't that only relevant if you control the firmware? I previously didn't really care about free firmware for devices in the same way as free software generally, but am beginning to see reasons to care. From richard.walsh at comcast.net Mon Jun 29 10:43:02 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Mon, 29 Jun 2009 17:43:02 +0000 (UTC) Subject: [Beowulf] Xeon Nehalem 5500 series (socket 1366) DP motherboard recommendations/experiences ... Message-ID: <1654630172.434901246297382712.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, I am putting together a bill of materials for a small cluster based on the Xeon Nehalem 5500 series. What dual-socket motherboards (ATX and ATX-extended) are people happy with? Which ones should I avoid? Thanks much, Richard Walsh Thrashing River Computing -------------- next part -------------- An HTML attachment was scrubbed... URL: From atchley at myri.com Mon Jun 29 10:44:47 2009 From: atchley at myri.com (Scott Atchley) Date: Mon, 29 Jun 2009 13:44:47 -0400 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87iqifc97o.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> Message-ID: <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> On Jun 29, 2009, at 12:10 PM, Dave Love wrote: >> When I test Open-MX, I turn interrupt coalescing off. I run >> omx_pingpong to determine the lowest latency (LL). If the NIC's >> driver >> allows one to specify the interrupt value, I set it to LL-1. > > Right, and that's what I did before, with sensible results I thought. > Repeating it now on Centos 5.2 and OpenSuSE 10.3, it doesn't behave > sensibly, and I don't know what's different from the previous SuSE > results apart, probably, from the minor kernel version. If I set > rx-frames=0, I see this: > > rx-usec latency (?s) > 20 34.6 > 12 26.3 > 6 20.0 > 1 14.8 > > whereas if I just set rx-frames=1, I get 14.7 ?s, roughly > independently > of rx-usec. (Those figures are probably ??0.2?s.) That is odd. I have only tested with Intel e1000 and our myri10ge Ethernet driver. The Intel driver does not let you specify value other than certain settings (0, 25, etc.). The myri10ge driver does allow you to specify any value. Your results may be specific to that driver. >> Brice and Nathalie have a paper which implements an adaptive >> interrupt >> coalescing so that you do not have to manually tune anything: > > Isn't that only relevant if you control the firmware? I previously > didn't really care about free firmware for devices in the same way as > free software generally, but am beginning to see reasons to care. True, I believe that had to make two very small modifications to the myri10ge firmware. I have head that some Ethernet drivers do or will support adaptive coalescing which may give better performance than manually tuning and without modifying the NIC firmware for OMX. Scott From atchley at myri.com Mon Jun 29 11:04:50 2009 From: atchley at myri.com (Scott Atchley) Date: Mon, 29 Jun 2009 14:04:50 -0400 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> Message-ID: <37FBF31A-82C4-4BE6-9F5E-C74E5C15BDFD@myri.com> On Jun 29, 2009, at 1:44 PM, Scott Atchley wrote: >> Right, and that's what I did before, with sensible results I thought. >> Repeating it now on Centos 5.2 and OpenSuSE 10.3, it doesn't behave >> sensibly, and I don't know what's different from the previous SuSE >> results apart, probably, from the minor kernel version. If I set >> rx-frames=0, I see this: >> >> rx-usec latency (?s) >> 20 34.6 >> 12 26.3 >> 6 20.0 >> 1 14.8 >> >> whereas if I just set rx-frames=1, I get 14.7 ?s, roughly >> independently >> of rx-usec. (Those figures are probably ??0.2?s.) > > That is odd. I have only tested with Intel e1000 and our myri10ge > Ethernet driver. The Intel driver does not let you specify value > other than certain settings (0, 25, etc.). The myri10ge driver does > allow you to specify any value. > > Your results may be specific to that driver. As Patrick kindly pointed out, you are using rx-frames and not rx- usec. They are not equivalent. Scott From rpnabar at gmail.com Mon Jun 29 12:10:15 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 29 Jun 2009 14:10:15 -0500 Subject: [Beowulf] typical latencies for gigabit ethernet In-Reply-To: References: Message-ID: On Sat, Jun 27, 2009 at 5:21 PM, Mark Hahn wrote: > seems to be fairly variable. ?let's say 50 +- 20 microseconds. > >> Setups is a simple: server<-->switch<-->server. > > it may be instructive to try a server-server test case. Hmm...well I must be doing something terribly wrong then. Our latencies are in the 140 microseconds range (as revealed by ping) What's a typical debug protocol? What should I be checking for? -- Rahul From hahn at mcmaster.ca Mon Jun 29 13:15:53 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 29 Jun 2009 16:15:53 -0400 (EDT) Subject: [Beowulf] typical latencies for gigabit ethernet In-Reply-To: References: Message-ID: > Hmm...well I must be doing something terribly wrong then. Our > latencies are in the 140 microseconds range (as revealed by ping) well, ping is not _really_ a benchmark ;) but it does sound like you have interrupt coalescing enabled. on our dl145g2 nodes (BCM95721), I can peel ~40 us off ping times by "ethtool -C eth1 rx-usecs 1". ping is also reporting round-trip, (MPI benchmarks quote half-rtt). also, ping is using 64B packets, not min-sized ones. > What's a typical debug protocol? What should I be checking for? well, look at "ethtool -c" output and try with "ping -s 20". From rpnabar at gmail.com Mon Jun 29 14:24:48 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 29 Jun 2009 16:24:48 -0500 Subject: [Beowulf] typical latencies for gigabit ethernet In-Reply-To: References: Message-ID: On Mon, Jun 29, 2009 at 3:15 PM, Mark Hahn wrote: Thanks for all the help Mark! > well, ping is not _really_ a benchmark ;) I thought so! :) Lazy person's first shot. Now I will try ethtool. > but it does sound like you have interrupt coalescing enabled. Any way to verify if I do? > ?ping is also reporting round-trip, Ah! So my "real" latencies are 140/2 = 70 microsecs. That doesn't sound so terribly bad now. > well, look at "ethtool -c" output and try with "ping -s 20". Tried this. No improvement. Still round-trip-pings are in the 140 microsec range. -- Rahul From d.love at liverpool.ac.uk Mon Jun 29 14:55:25 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 22:55:25 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> (Scott Atchley's message of "Mon, 29 Jun 2009 18:44:47 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> Message-ID: <87hbxy1z8y.fsf@liv.ac.uk> Scott Atchley writes: > That is odd. I have only tested with Intel e1000 and our myri10ge > Ethernet driver. The Intel driver does not let you specify value other > than certain settings (0, 25, etc.). I can't remember if I tried it, but it's documented to be adjustable in the range 100-100000 interrupts/s. > Your results may be specific to that driver. Sure. The only ones I have access to which are tunable with ethtool are the two Broadcoms for which I posted the data. > I have head that some Ethernet drivers do or will support adaptive > coalescing which may give better performance than manually tuning and > without modifying the NIC firmware for OMX. As I said, not in the case of the ones I tried (even in the low-latency adaptive mode which the e1000 doc says to use for HPC). From d.love at liverpool.ac.uk Mon Jun 29 14:57:32 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 22:57:32 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <37FBF31A-82C4-4BE6-9F5E-C74E5C15BDFD@myri.com> (Scott Atchley's message of "Mon, 29 Jun 2009 19:04:50 +0100") References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> <37FBF31A-82C4-4BE6-9F5E-C74E5C15BDFD@myri.com> Message-ID: <87fxdi1z5f.fsf@liv.ac.uk> Scott Atchley writes: > As Patrick kindly pointed out, you are using rx-frames and not rx- > usec. They are not equivalent. That's something I haven't seen. However, I'm only using rx-frames=1 because simply adjusting rx-usec doesn't behave as expected. (It's documented, but perhaps only in the ethtool source, that rx-frames=0 means to use just rx-usec, which is what you want to be able to adjust the coalescence as suggested.) From patrick at myri.com Mon Jun 29 15:20:40 2009 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 29 Jun 2009 18:20:40 -0400 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87fxdi1z5f.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> <37FBF31A-82C4-4BE6-9F5E-C74E5C15BDFD@myri.com> <87fxdi1z5f.fsf@liv.ac.uk> Message-ID: <4A493E38.3070302@myri.com> Dave Love wrote: > That's something I haven't seen. However, I'm only using rx-frames=1 > because simply adjusting rx-usec doesn't behave as expected. Instead of rx-usecs being the time between interrupts, it is sometimes implemented as the delay between the the first packet and the following interrupt, which is obviously wrong. rx-frames may generate an interrupt storm if you receive a stream of small packets. You may want to measure how many interrupts per second are produced in this case, compared to using rx-usecs. Patrick From patrick at myri.com Mon Jun 29 15:22:23 2009 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 29 Jun 2009 18:22:23 -0400 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87iqifc97o.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> Message-ID: <4A493E9F.2030505@myri.com> Dave, Scott, Dave Love wrote: > Scott Atchley writes: > >> When I test Open-MX, I turn interrupt coalescing off. I run >> omx_pingpong to determine the lowest latency (LL). If the NIC's driver >> allows one to specify the interrupt value, I set it to LL-1. Note that it is only meaningful wrt ping-pong latency. To optimize for all latency cases, you just want interrupt coalescing to be off. > results apart, probably, from the minor kernel version. If I set > rx-frames=0, I see this: > > rx-usec latency (?s) > 20 34.6 > 12 26.3 > 6 20.0 > 1 14.8 > > whereas if I just set rx-frames=1, I get 14.7 ?s, roughly independently > of rx-usec. (Those figures are probably ??0.2?s.) rx-usecs specifies the minimum time between interrupts, whereas rx-frames specifies the number of frames (packets) between interrupts. So, if you set rx-frames to 1, there will be an interrupt after each packet. Not many devices implement rx-frames, since it does not distinguish between small and large frames. Adaptive coalescing methods do look at the size of the frames to figure out if the traffic is mostly latency or bandwidth sensitive, but it's just a guess. >> The downside is lower throughput for large messages on 10G Ethernet. I >> don't think it matters on gigabit. > > It doesn't affect the ping-pong throughput significantly, but I don't > know if it has any effect on the system overall (other cores servicing > the interrupts) on `typical' jobs. On GigE, each 1500 Bytes frames takes more than 10us on the wire so even with interrupt coalescing turned off, you won't get more than 100K interrupts per second. It used to be a problem, but it's no big deal on recent machines. However, you can get a lot more interrupt when receiving smaller packets, although the interrupt overhead itself would limit the interrupt load to well below 1 Million per second. In the worst case, you would lose a core if you don't let the OS move the interrupt handler to do load balancing. What is one core these days ? :-) Patrick From d.love at liverpool.ac.uk Mon Jun 29 15:43:53 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 29 Jun 2009 23:43:53 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: (Rahul Nabar's message of "Mon, 29 Jun 2009 16:24:48 -0500") References: Message-ID: <87bpo61x06.fsf@liv.ac.uk> Rahul Nabar writes: > I thought so! :) Lazy person's first shot. Now I will try ethtool. It's not relevant with all NICs. Some use driver module parameters. > Any way to verify if I do? Consult the NIC's documentation. > Ah! So my "real" latencies are 140/2 = 70 microsecs. I see ping times of ~70?s between the nVidias I posted data on, and they have an MPI latency of ~12?s. If you want a measurement/benchmark for MPI performance use IMB, like for the results I posted. Otherwise people sometimes use netpipe. If you want basic Ethernet (Layer 2) figures, look for Hughes-Jones' ethmon, somewhere under http://www.hep.man.ac.uk/. From tom.elken at qlogic.com Mon Jun 29 16:44:36 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Mon, 29 Jun 2009 16:44:36 -0700 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87bpo61x06.fsf@liv.ac.uk> References: <87bpo61x06.fsf@liv.ac.uk> Message-ID: <35AAF1E4A771E142979F27B51793A48882C583FDE0@AVEXMB1.qlogic.org> > > Ah! So my "real" latencies are 140/2 = 70 microsecs. > > I see ping times of ~70?s between the nVidias I posted data on, and > they > have an MPI latency of ~12?s. If you want a measurement/benchmark for > MPI performance use IMB, like for the results I posted. In addition to Intel MPI Benchmarks (formerly known as Pallas), osu_latency is another common benchmark for MPI latency, part of the OSU MPI Benchmark (OMB) suite, available here: http://mvapich.cse.ohio-state.edu/benchmarks/ -Tom > Otherwise > people sometimes use netpipe. If you want basic Ethernet (Layer 2) > figures, look for Hughes-Jones' ethmon, somewhere under > http://www.hep.man.ac.uk/. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hearnsj at googlemail.com Tue Jun 30 00:01:55 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 30 Jun 2009 08:01:55 +0100 Subject: [Beowulf] typical latencies for gigabit ethernet In-Reply-To: References: Message-ID: <9f8092cc0906300001h17880cc8x688367f2c8df5574@mail.gmail.com> 2009/6/29 Rahul Nabar : > On Mon, Jun 29, 2009 at 3:15 PM, Mark Hahn wrote: > > Thanks for all the help Mark! > > >> well, ping is not _really_ a benchmark ;) > > I thought so! :) Lazy person's first shot. Now I will try ethtool. > Rahul, its maybe worth making something explicitly clear. When folks on this list talk about 'latency' they are talking about times for sending MPI messages. The Pallas and OMB benchmarks suggested do that. When you use 'ping' you are measuring the trip time for IP ICMP packages, which, yep, is a valid measurement and gives a good idea of performance. From d.love at liverpool.ac.uk Tue Jun 30 06:07:59 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 30 Jun 2009 14:07:59 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4DF2DCCD-FF21-460F-89BF-238582FFEBE0@myri.com> <37FBF31A-82C4-4BE6-9F5E-C74E5C15BDFD@myri.com> <87fxdi1z5f.fsf@liv.ac.uk> <4A493E38.3070302@myri.com> Message-ID: <878wj9c1jk.fsf@liv.ac.uk> Patrick Geoffray writes: > Instead of rx-usecs being the time between interrupts, it is sometimes > implemented as the delay between the the first packet and the following > interrupt, which is obviously wrong. Ah. Is that likely to be in the driver, where it might be fixed, or the NIC firmware? > rx-frames may generate an interrupt storm if you receive a stream of > small packets. You may want to measure how many interrupts per second > are produced in this case, compared to using rx-usecs. Yes, but I'm not sure how useful that is since the rx-usecs adjustment doesn't seem sensible in this case. Also I'm not actually interested in tg3 for production use, so maybe someone who is would like to try. From gerry.creager at tamu.edu Tue Jun 30 06:10:59 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 30 Jun 2009 08:10:59 -0500 Subject: [Beowulf] Re: typical latencies for gigabit ethernet In-Reply-To: <87my7rcakg.fsf@liv.ac.uk> References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <4A48C7EC.7090607@tamu.edu> <87my7rcakg.fsf@liv.ac.uk> Message-ID: <4A4A0EE3.6060900@tamu.edu> Dave Love wrote: > Gerry Creager writes: > >> I had rather nasty results with tg3 and abandoned it. We're using bnx2 >> now. The latest iteration seems (guardedly) better than the last one. > > I thought that they were for different hardware (NetXtreme I > c.f. NetXtreme II, according to broadcom.com). Is that not the case? > > If bnx2 behaves the same as tg3 here, set rx-frames to 1 with ethtool. Anaconda (CentOS2 originally installed a tg3 driver. Performance sucked (technical term) with that and we went to a bnx2 driver consistent with our understanding of the hardware. It sucked less. Not until a recent combined BIOS update on the Dell 1950s and a new bnx2 driver, have I begun to see decent operation at 9000 byte frames. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From d.love at liverpool.ac.uk Tue Jun 30 06:17:35 2009 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 30 Jun 2009 14:17:35 +0100 Subject: [Beowulf] Re: typical latencies for gigabit ethernet References: <87vdmg1sxp.fsf@liv.ac.uk> <4A47971B.3000203@tamu.edu> <87r5x3e4xs.fsf@liv.ac.uk> <87iqifc97o.fsf@liv.ac.uk> <4A493E9F.2030505@myri.com> Message-ID: <877hytc13k.fsf@liv.ac.uk> Patrick Geoffray writes: > So, if you set rx-frames to 1, there will be an interrupt after each > packet. Isn't that turning off coalescence, as you recommended? > Not many devices implement rx-frames, since it does not > distinguish between small and large frames. Adaptive coalescing methods > do look at the size of the frames to figure out if the traffic is mostly > latency or bandwidth sensitive, but it's just a guess. Yes. With e1000, I saw 28.2?s omx_perf latency using InterruptThrottleRate=1, v. 19.5 using InterruptThrottleRate=0. With forcedeth, optimization_mode=1, it was 20.2, v. 10.4 with optimization_mode=0 (the default). These weren't with the same setup as the figures on the web page I referred to, by the way. > On GigE, each 1500 Bytes frames takes more than 10us on the wire so even > with interrupt coalescing turned off, you won't get more than 100K > interrupts per second. Good point. > In the worst case, you would lose a core if you don't let the OS move > the interrupt handler to do load balancing. What is one core these > days ? :-) I guess that depends on what everything else is doing. It's normally recommended to use the default (non-)affinity of interrupts, isn't it? I'll try to collect anything useful from this for the Open-MX FAQ. This stuff seems generally badly documented (such as ethtool not even telling you what the coalescence parameters actually are). Thanks (and thanks to Myricom for funding Open-MX, by the way). From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 30 06:54:45 2009 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue, 30 Jun 2009 15:54:45 +0200 (CEST) Subject: [Beowulf] Parallel Programming Question In-Reply-To: <4A427CAA.9050304@ldeo.columbia.edu> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> Message-ID: On Wed, 24 Jun 2009, Gus Correa wrote: > the "master" processor reads... broadcasts parameters that are used > by all "slave" processors, and scatters any data that will be > processed in a distributed fashion by each "slave" processor. > ... > That always works, there is no file system contention. I beg to disagree. There is no file system contention if this job is the only one doing the I/O at that time, which could be the case if a job takes the whole cluster. However, in a more conventional setup with several jobs running at the same time, there is I/O done from several nodes (running the MPI rank 0 of each job) at the same time, which will still look like mostly random I/O to the storage. > Another drawback is that you need to write more code for the I/O > procedure. I also disagree here. The code doing I/O would need to only happen on MPI rank 0, so no need to think for the other ranks about race conditions, computing a rank-based position in the file, etc. > In addition, MPI is in control of everything, you are less dependent > on NFS quirks. ... or cluster design. I have seen several clusters which were designed with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the HPC network had full bisection bandwidth, but the GigE was a heavily over-subscribed one as the design really thought only about MPI performance and not about I/O performance. In such an environment, it's rather useless to try to do I/O simultaneously from several nodes which share the same uplink, independent whether the storage is a single NFS server or a parallel FS. Doing I/O from only one node would allow full utilization of the bandwidth on the chain of uplinks to the file-server and the data could then be scattered/gathered fast through the HPC network. Sure, a more hardware-aware application could have been more efficient (f.e. if it would be possible to describe the network over-subscription so that as many uplinks could be used simultaneously as possible), but a more balanced cluster design would have been even better... > [ parallel I/O programs ] always cause a problem when the number > of processors is big. I'd also like to disagree here. Parallel file systems teach us that a scalable system is one where the operations are split between several units that do the work. Applying the same knowledge to the generation of the data, a scalable application is one for which the I/O operations are done as much as possible split between the ranks. IMHO, the "problem" that you see is actually caused by reaching the limits of your cluster, IOW this is a local problem of that particular cluster and not a problem in the application. By re-writing the application to make it more NFS-friendly (f.e. like the above "rank 0 does all I/O"), you will most likely kill scalability for another HPC setup with a distributed/parallel storage setup. > Often times these codes were developed on big iron machines, > ignoring the hurdles one has to face on a Beowulf. Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently easy to get a parallel FS running with commodity hardware that I wouldn't associate it anymore with big iron. > In general they don't use MPI parallel I/O either Being on the teaching side in a recent course+practical work involving parallel I/O, I've seen computer science and physics students making quite easily the transition from POSIX I/O done on a shared file system to MPI-I/O. They get sometimes an index wrong, but mostly the conversion is painless. After that, my impression has become that it's mostly lazyness and the attitude 'POSIX is everywhere anywhere, why should I bother with something that might be missing' that keeps applications at this stage. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu at iwr.uni-heidelberg.de From balaji at mcs.anl.gov Tue Jun 30 09:39:10 2009 From: balaji at mcs.anl.gov (Pavan Balaji) Date: Tue, 30 Jun 2009 11:39:10 -0500 Subject: [Beowulf] [hpc-announce] Submission deadline for HPIDC '09 workshop extended to July 10th, 2009 Message-ID: <4A4A3FAE.3080603@mcs.anl.gov> Dear Colleague: This is to inform you that the International Workshop on High Performance Interconnects for Distributed Computing (HPIDC) has extended the paper submission deadline to July 10th, 2009. A full CFP can be found below. Regards, -- Ada & Pavan ---------------------------------------------------------------------- Workshop on High Performance Interconnects for Distributed Computing (HPI-DC'09) in conjunction with Cluster 2009 August 31, 2009, New Orleans, Louisiana http://www.cercs.gatech.edu/hpidc2009 ********************************************************************* Call for Papers ********************************************************************* The emergence of 10.0 GigE and above, InfiniBand and other high-performance interconnection technologies, programmable NICs and networking platforms, and protocols like DDP and RDMA over IP, make it possible to create tightly linked systems across physical distances that exceed those of traditional single cluster or server systems. These technologies can deliver communication capabilities that achieve the performance levels needed by high end applications in enterprise systems and like those produced by the high performance computing community. Furthermore, the manycore nature of next generation platforms and the creation of distributed cloud computing infrastructure will greatly increase the demand for high performance communication capabilities over wide area distances. The purpose of this workshop is to explore the confluence of distributed computing and communications technologies with high performance interconnects, as applicable or applied to realistic high end applications. The intent is to create a venue that will act as a bridge between researchers developing tools and platforms for high-performance distributed computing, end user applications seeking high performance solutions, and technology providers aiming to improve interconnect and networking technologies for future systems. The hope is to foster knowledge creation and intellectual interchanges between HPC and Cloud computing end users and technology developers, in the specific domain of high performance distributed interconnects. Topics of interest include but are not limited to: # Hardware/software architectures for communication infrastructures for HPC and Cloud Computing # Data and control protocols for interactive and large data volume applications # Novel devices and technologies to enhance interconnect properties # Interconnect-level issues when extending high performance beyond single machines, including architecture, protocols, services, QoS, and security # Remote storage (like iSCSI), remote databases, and datacenters, etc. # Development tools, programming environments and models (like PGAS, OpenShmem, Hadoop, etc.), ranging from programming language support to simulation environments. PAPER SUBMISSIONS: HPI-DC invites authors to submit original and unpublished work. Please submit extended abstracts or full papers, not exceeding 8 double-column pages in 10 point font or larger, in IEEE format. Electronic submission is strongly encouraged. Hard copies will be accepted only if electronic submission is not possible. Submission implies the willingness of at least one of the authors to register and present the paper. Any questions concerning hardcopy submissions or any other issues may be directed to the Program Co-Chairs. IMPORTANT DATES: # Paper submission: July 10th, 2009 # Notification of acceptance: July 22nd, 2009 # Final manuscript due: July 29th, 2009 # Workshop date: Aug. 31st, 2009 ORGANIZATION: General Chair # Steve Poole, Oak Ridge National Lab Program Co-Chairs # Pavan Balaji, Argonne National Lab # Ada Gavrilovska, Georgia Institute of Technology Technical Program Committee: # Ahmad Afsahi, Queen's University, Canada # Taisuke Boku, University of Tsukuba, Japan # Ron Brightwell, Sandia National Laboratory # Patrick Geoffray, Myricom # Kei Hiraki, University of Tokyo, Japan # Hyun-wook Jin, Konkuk University, Korea # Pankaj Mehra, HP Research # Guillaume Mercier, INRIA, France # Scott Pakin, Los Alamos National Laboratory # D. K. Panda, Ohio State University # Fabrizio Petrini, IBM Research # Karsten Schwan, Georgia Tech # Jesper Traeff, NEC, Europe # Sudhakar Yalamanchili, Georgia Tech # Weikuan Yu, Auburn University If you have any questions about the workshop, please contact us at hpidc09-chairs at mcs.anl.gov. ==================================================================== If you do not wish to receive any more emails on this list, you can unsubscribe here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ==================================================================== -- Pavan Balaji http://www.mcs.anl.gov/~balaji From gus at ldeo.columbia.edu Tue Jun 30 17:06:25 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 30 Jun 2009 20:06:25 -0400 Subject: [Beowulf] Parallel Programming Question In-Reply-To: References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> Message-ID: <4A4AA881.7030806@ldeo.columbia.edu> Hi Bogdan, list Oh, well, this is definitely a peer reviewed list. My answers were given in the context of Amjad's original questions, and the perception, based on Amjad's previous and current postings, that he is not dealing with a large cluster, or with many users, and plans to both parallelize and update his code from F77 to F90, which can be quite an undertaking. Hence, he may want to follow the path of least resistance, rather than aim at the fanciest programming paradigm. In the edited answer that context was stripped off, and so was the description of "brute force" I/O in parallel jobs. That was the form of concurrent I/O I was referring to. An I/O mode which doesn't take any precautions to avoid file and network contention, and unfortunately is more common than clean, well designed, parallel I/O code (at least on the field I work). That was the form of concurrent I/O I was referring to (all processors try to do I/O at the same time using standard open/read/write/close commands provided by Fortran or another language, not MPI calls). Bogdan seems to be talking about programs with well designed parallel I/O instead. Bogdan Costescu wrote: > On Wed, 24 Jun 2009, Gus Correa wrote: > >> the "master" processor reads... broadcasts parameters that are used by >> all "slave" processors, and scatters any data that will be processed >> in a distributed fashion by each "slave" processor. >> ... >> That always works, there is no file system contention. > > I beg to disagree. There is no file system contention if this job is the > only one doing the I/O at that time, which could be the case if a job > takes the whole cluster. However, in a more conventional setup with > several jobs running at the same time, there is I/O done from several > nodes (running the MPI rank 0 of each job) at the same time, which will > still look like mostly random I/O to the storage. > Indeed, if there are 1000 jobs running, even if each one is funneling I/O through the "master" processor, there will be a large number of competing requests to the I/O system, hence contention. However, contention would also happen if all jobs were serial. Hence, this is not a problem caused by or specific from parallel jobs. It is an intrinsic limitation of the I/O system. Nevertheless, what if these 1000 jobs are running on the same cluster, but doing "brute force" I/O through each of their, say, 100 processes? Wouldn't file and network contention be larger than if the jobs were funneling I/O through a single processor? That is the context in which I made my statement. Funneling I/O through a "master" processor reduces the chances of file contention because it minimizes the number of processes doing I/O, or not? >> Another drawback is that you need to write more code for the I/O >> procedure. > > I also disagree here. The code doing I/O would need to only happen on > MPI rank 0, so no need to think for the other ranks about race > conditions, computing a rank-based position in the file, etc. > From what you wrote, you seem to agree with me on this point, not disagree. 1) Brute force I/O through all ranks takes little programming effort, the code is basically the same serial, and tends to trigger file contention (and often breaks NFS, etc). 2) Funneling I/O through the master node takes a moderate programming effort. One needs to gather/scatter data through the "master" processor, which concentrates the I/O, and reduces contention. 3) Correct and cautious parallel I/O across all ranks takes a larger programming effort, due to the considerations you pointed out above. >> In addition, MPI is in control of everything, you are less dependent >> on NFS quirks. > > ... or cluster design. I have seen several clusters which were designed > with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the > HPC network had full bisection bandwidth, but the GigE was a heavily > over-subscribed one as the design really thought only about MPI > performance and not about I/O performance. In such an environment, it's > rather useless to try to do I/O simultaneously from several nodes which > share the same uplink, independent whether the storage is a single NFS > server or a parallel FS. Doing I/O from only one node would allow full > utilization of the bandwidth on the chain of uplinks to the file-server > and the data could then be scattered/gathered fast through the HPC > network. Sure, a more hardware-aware application could have been more > efficient (f.e. if it would be possible to describe the network > over-subscription so that as many uplinks could be used simultaneously > as possible), but a more balanced cluster design would have been even > better... > Absolutely, but the emphasis I've seen, at least for small clusters designed for scientific computations in a small department or research group is to pay less attention to I/O that I had the chance to know about. When one gets to the design of the filesystems and I/O the budget is already completely used up to buy a fast interconnect for MPI. I/O is then done over Gigabit Ethernet using a single NFS file server (often times a RAID on the head node itself). For the scale of a small cluster, with a few tens of nodes or so, this may work OK, as long as one writes code that is gentle with NFS (e.g. by funneling I/O through the head node). Obviously the large clusters on our national labs and computer centers do take into consideration I/O requirements, parallel file systems, etc. However, that is not my reality here, and I would guess it is not Amjad's situation either. >> [ parallel I/O programs ] always cause a problem when the number of >> processors is big. > Sorry, but I didn't say parallel I/O programs. I said brute force I/O by all processors (using standard NFS, no parallel file system, all processors banging on the file system with no coordination). > I'd also like to disagree here. Parallel file systems teach us that a > scalable system is one where the operations are split between several > units that do the work. Applying the same knowledge to the generation of > the data, a scalable application is one for which the I/O operations are > done as much as possible split between the ranks. Yes. If you have a parallel file system. > > IMHO, the "problem" that you see is actually caused by reaching the > limits of your cluster, IOW this is a local problem of that particular > cluster and not a problem in the application. By re-writing the > application to make it more NFS-friendly (f.e. like the above "rank 0 > does all I/O"), you will most likely kill scalability for another HPC > setup with a distributed/parallel storage setup. > Yes, that is true, but may only be critical if the program is I/O intensive (ours are not). One may still fare well with funneling I/O through one or a few processors, if the program is not I/O intensive, and not compromise scalability. The opposite, however, i.e., writing the program expecting the cluster to provide a parallel file system, is unlikely to scale well on a cluster without one, or not? >> Often times these codes were developed on big iron machines, ignoring >> the hurdles one has to face on a Beowulf. > > Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently > easy to get a parallel FS running with commodity hardware that I > wouldn't associate it anymore with big iron. > That is true, but very budget dependent. If you are on a shoestring budget, and your goal is to do parallel computing, and your applications are not particularly I/O intensive, what would you prioritize: a fast interconnect for MPI, or hardware and software for a parallel file system? >> In general they don't use MPI parallel I/O either > > Being on the teaching side in a recent course+practical work involving > parallel I/O, I've seen computer science and physics students making > quite easily the transition from POSIX I/O done on a shared file system > to MPI-I/O. They get sometimes an index wrong, but mostly the conversion > is painless. After that, my impression has become that it's mostly > lazyness and the attitude 'POSIX is everywhere anywhere, why should I > bother with something that might be missing' that keeps applications at > this stage. > I agree with your considerations about laziness and the POSIX-inertia. However, there is still a long way to make programs and programmers at least consider the restrictions imposed by network and file systems, not to mention to use proper parallel I/O. Hopefully courses like yours will improve this. If I could, I would love to go to Heidelberg and take your class myself! Regards, Gus Correa From amjad11 at gmail.com Tue Jun 30 21:55:51 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 1 Jul 2009 09:55:51 +0500 Subject: [Beowulf] Parallel Programming Question In-Reply-To: <4A4AA881.7030806@ldeo.columbia.edu> References: <428810f20906232123x4ba721aye1f4c64edec741b0@mail.gmail.com> <4A427CAA.9050304@ldeo.columbia.edu> <4A4AA881.7030806@ldeo.columbia.edu> Message-ID: <428810f20906302155v1d2b3b55s65f13be8a8964e42@mail.gmail.com> Hi, Gus--thank you. You are right. I mainly have to run programs on a small cluster (GiGE) dedicated for my job only; and sometimes I might get some opportunity to run my code on a shared cluster with hundreds of nodes. My parallel CFD application involves (In its simplest form): 1) reading of input and mesh data from few files by the master process (I/O) 2) Sending the full/respective data to all other processes (MPI Broadcast/Scatter) 3) Share the values at subdomains (created by Metis) boundaries at the end of each iteration (MPI Message Passing) 4) On converge, send/Gather the results from all processes to master process 5) Writing the results to files by the master process (I/O). So I think my program is not I/O intensive; so the Funneling I/O through the master process is sufficient for me. Right? But now I have to parallelize a new serial code, which plots the results while running (online/live display). Means that it shows the plots of three/four variables (in small windows) while running and we see it as video (all progress from initial stage to final stage). I assume that this time much more I/O is involved. At the end of each iteration result needs to be gathered from all processes to the master process. And possibly needs to be written in files as well (I am not sure). Do we need to write it on some file/s for online display, after each iteration/time-step? I think (as serial code will be displaying result after each iteration/time step), I should display result online after 100 iterations/time-steps in my parallel version so less "I/O" and/or "funneling I/O through master process" will be required. Any opinion/suggestion? regards, Amjad Ali. On Wed, Jul 1, 2009 at 5:06 AM, Gus Correa wrote: > Hi Bogdan, list > > Oh, well, this is definitely a peer reviewed list. > My answers were given in the context of Amjad's original > questions, and the perception, based on Amjad's previous > and current postings, that he is not dealing with a large cluster, > or with many users, and plans to both parallelize and update his code from > F77 to F90, which can be quite an undertaking. > Hence, he may want to follow the path of least resistance, > rather than aim at the fanciest programming paradigm. > > In the edited answer that context was stripped off, > and so was the description of > "brute force" I/O in parallel jobs. > That was the form of concurrent I/O I was referring to. > An I/O mode which doesn't take any precautions > to avoid file and network contention, and unfortunately is more common > than clean, well designed, parallel I/O code (at least on the field I > work). > > That was the form of concurrent I/O I was referring to > (all processors try to do I/O at the same time using standard > open/read/write/close commands provided by Fortran or another language, > not MPI calls). > > Bogdan seems to be talking about programs with well designed parallel I/O > instead. > > Bogdan Costescu wrote: > >> On Wed, 24 Jun 2009, Gus Correa wrote: >> >> the "master" processor reads... broadcasts parameters that are used by >>> all "slave" processors, and scatters any data that will be processed in a >>> distributed fashion by each "slave" processor. >>> ... >>> That always works, there is no file system contention. >>> >> >> I beg to disagree. There is no file system contention if this job is the >> only one doing the I/O at that time, which could be the case if a job takes >> the whole cluster. However, in a more conventional setup with several jobs >> running at the same time, there is I/O done from several nodes (running the >> MPI rank 0 of each job) at the same time, which will still look like mostly >> random I/O to the storage. >> >> > Indeed, if there are 1000 jobs running, > even if each one is funneling I/O through > the "master" processor, there will be a large number of competing > requests to the I/O system, hence contention. > However, contention would also happen if all jobs were serial. > Hence, this is not a problem caused by or specific from parallel jobs. > It is an intrinsic limitation of the I/O system. > > Nevertheless, what if these 1000 jobs are running on the same cluster, > but doing "brute force" I/O through > each of their, say, 100 processes? > Wouldn't file and network contention be larger than if the jobs were > funneling I/O through a single processor? > > That is the context in which I made my statement. > Funneling I/O through a "master" processor reduces the chances of file > contention because it minimizes the number of processes doing I/O, > or not? > > Another drawback is that you need to write more code for the I/O >>> procedure. >>> >> >> I also disagree here. The code doing I/O would need to only happen on MPI >> rank 0, so no need to think for the other ranks about race conditions, >> computing a rank-based position in the file, etc. >> >> > From what you wrote, > you seem to agree with me on this point, not disagree. > > 1) Brute force I/O through all ranks takes little programming effort, > the code is basically the same serial, > and tends to trigger file contention (and often breaks NFS, etc). > 2) Funneling I/O through the master node takes a moderate programming > effort. One needs to gather/scatter data through the "master" processor, > which concentrates the I/O, and reduces contention. > 3) Correct and cautious parallel I/O across all ranks takes a larger > programming effort, > due to the considerations you pointed out above. > > In addition, MPI is in control of everything, you are less dependent on >>> NFS quirks. >>> >> >> ... or cluster design. I have seen several clusters which were designed >> with 2 networks, a HPC one (Myrinet or Infiniband) and GigE, where the HPC >> network had full bisection bandwidth, but the GigE was a heavily >> over-subscribed one as the design really thought only about MPI performance >> and not about I/O performance. In such an environment, it's rather useless >> to try to do I/O simultaneously from several nodes which share the same >> uplink, independent whether the storage is a single NFS server or a parallel >> FS. Doing I/O from only one node would allow full utilization of the >> bandwidth on the chain of uplinks to the file-server and the data could then >> be scattered/gathered fast through the HPC network. Sure, a more >> hardware-aware application could have been more efficient (f.e. if it would >> be possible to describe the network over-subscription so that as many >> uplinks could be used simultaneously as possible), but a more balanced >> cluster design would have been even better... >> >> > Absolutely, but the emphasis I've seen, at least for small clusters > designed for scientific computations in a small department or research > group is to pay less attention to I/O that I had the chance to know about. > When one gets to the design of the filesystems and I/O the budget is > already completely used up to buy a fast interconnect for MPI. > I/O is then done over Gigabit Ethernet using a single NFS > file server (often times a RAID on the head node itself). > For the scale of a small cluster, with a few tens of nodes or so, > this may work OK, > as long as one writes code that is gentle with NFS > (e.g. by funneling I/O through the head node). > > Obviously the large clusters on our national labs and computer centers > do take into consideration I/O requirements, parallel file systems, etc. > However, that is not my reality here, and I would guess it is > not Amjad's situation either. > > [ parallel I/O programs ] always cause a problem when the number of >>> processors is big. >>> >> >> > Sorry, but I didn't say parallel I/O programs. > I said brute force I/O by all processors (using standard NFS, > no parallel file system, all processors banging on the file system > with no coordination). > > I'd also like to disagree here. Parallel file systems teach us that a >> scalable system is one where the operations are split between several units >> that do the work. Applying the same knowledge to the generation of the data, >> a scalable application is one for which the I/O operations are done as much >> as possible split between the ranks. >> > > Yes. > If you have a parallel file system. > > >> IMHO, the "problem" that you see is actually caused by reaching the limits >> of your cluster, IOW this is a local problem of that particular cluster and >> not a problem in the application. By re-writing the application to make it >> more NFS-friendly (f.e. like the above "rank 0 does all I/O"), you will most >> likely kill scalability for another HPC setup with a distributed/parallel >> storage setup. >> >> Yes, that is true, but may only be critical if the program is I/O > intensive (ours are not). > One may still fare well with funneling I/O through one or a few > processors, if the program is not I/O intensive, > and not compromise scalability. > > The opposite, however, i.e., > writing the program expecting the cluster to > provide a parallel file system, > is unlikely to scale well on a cluster > without one, or not? > > Often times these codes were developed on big iron machines, ignoring the >>> hurdles one has to face on a Beowulf. >>> >> >> Well, the definition of Beowulf is quite fluid. Nowadays is sufficiently >> easy to get a parallel FS running with commodity hardware that I wouldn't >> associate it anymore with big iron. >> >> > That is true, but very budget dependent. > If you are on a shoestring budget, and your goal is to do parallel > computing, and your applications are not particularly I/O intensive, > what would you prioritize: a fast interconnect for MPI, > or hardware and software for a parallel file system? > > In general they don't use MPI parallel I/O either >>> >> >> Being on the teaching side in a recent course+practical work involving >> parallel I/O, I've seen computer science and physics students making quite >> easily the transition from POSIX I/O done on a shared file system to >> MPI-I/O. They get sometimes an index wrong, but mostly the conversion is >> painless. After that, my impression has become that it's mostly lazyness and >> the attitude 'POSIX is everywhere anywhere, why should I bother with >> something that might be missing' that keeps applications at this stage. >> >> > I agree with your considerations about laziness and the POSIX-inertia. > However, there is still a long way to make programs and programmers > at least consider the restrictions imposed by network and file systems, > not to mention to use proper parallel I/O. > Hopefully courses like yours will improve this. > If I could, I would love to go to Heidelberg and take your class myself! > > Regards, > Gus Correa > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karlanmitchell at gmail.com Mon Jun 29 11:26:19 2009 From: karlanmitchell at gmail.com (Karlan T. Mitchell) Date: Mon, 29 Jun 2009 11:26:19 -0700 Subject: [Beowulf] Sell a U41 rack w/ computers and switches Message-ID: <2f39fbb60906291126o1054bf9dg5541a7f88b90b458@mail.gmail.com> Hello everyone, I helped a friend get his cluster up and running and but when he found out how much it would cost for electricity and maintenance he became fond of the idea of just selling the thing. Its a U41 amax rack with 14 blades with 4 1.6ghz processors, 4gb ram, 250gb harddisks each. Two heavy duty UPSs, Three switches, everything is gigabit, and controlled by bladelike monitor/keyboard/mouse. Basically it is good to go. I have written software which pxe boots an operating system on all nodes except for the first, making maintenance way easy for anyone who knows what they are doing.(my website has more details) I would put this guy on ebay, but considering we needed a forklift to move it in the first place it be great to find someone in the San Francisco bay area to avoid a shipping nightmare. Thanks, - Karlan Thomas Mitchell - http://3dstoneage.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From p2pnve at gmail.com Tue Jun 30 17:32:13 2009 From: p2pnve at gmail.com (p2pnve) Date: Wed, 1 Jul 2009 08:32:13 +0800 Subject: [Beowulf] [CFP][Two_week reminder] P2PNVE_2009 In-Reply-To: <8d5441200906301726k777553fel7e931603ca6a249a@mail.gmail.com> References: <8d5441200906301720j66a68ae2sae5660ae2debfa86@mail.gmail.com> <8d5441200906301726k777553fel7e931603ca6a249a@mail.gmail.com> Message-ID: <8d5441200906301732r35ca209et1be42f4f3dba5602@mail.gmail.com> Dear Colleagues, Please help distribute the CFP and consider submitting one or more papers to the 3rd International Workshop on Peer-to-Peer Networked Virtual Environments (P2PNVE 2009), which will be held on December 9 -11, 2009, in conjunction with the 15th International Conference on Parallel and Distributed Systems (ICPADS 2009), Shenzhen, China. Thank you very much for your help. Best wishes, P2PNVE 2009 Program Committee ================ CALL FOR PAPERS ================= P2PNVE 2009 The 3rd International Workshop on Peer-to-Peer Networked Virtual Environments in conjunction with The 15th International Conference on Parallel and Distributed Systems (ICPADS 2009) December 9 -11, 2009 Shenzhen, China http://acnlab.csie.ncu.edu.tw/P2PNVE2009 ================================================= About P2PNVE 2009 The rapid growth and popularity of networked virtual environments(NVEs) such as Massively Multiplayer Online Games (MMOGs) in recent years have spawned a series of research interests in constructing large-scale virtual environments. For increasing scalability and decreasing the cost of management and deployment, more and more studies propose using peer-to-peer (P2P) architectures to construct large-scale NVEs for games, multimedia virtual worlds and other applications. The goal of such research is to support an Earth-scale virtual environment or to make hosting virtual worlds more affordable than existing client-server approaches. However, existing solutions for consistency control, persistent data storage, multimedia data dissemination, cheat-prevention, topology mismatching, and virtual world interoperability are not straightforwardly adapted to such new environments. Novel ideas and designs thus are needed to realize the potential of P2P-based NVEs. The 1st and the 2nd International Workshop on Peer-to-Peer Networked Virtual Environments were in conjunction with the 13th and 14th International Conference on Parallel and Distributed Systems in 2007 and 2008, respectively. To adhere to the theme of P2PNVE workshops, the theme of P2PNVE 2009 is to solicit original and previously unpublished new ideas on general P2P schemes and on the design and realization of P2P-based NVEs. The workshop aims to facilitate discussions and idea exchanges by both academics and practitioners. Topics of interest include, but are not limited to: -P2P systems and infrastructures -Applications of P2P systems -Performance evaluation of P2P systems -Trust and security issues in P2P systems -Network support for P2P systems -Fault tolerance in P2P systems -Efficient P2P resource lookup and sharing -Distributed Hash Tables (DHTs) and related issues -Solutions to topology mismatching for P2P overlays -P2P overlays for NVEs -P2P NVE multicast -P2P NVE interoperability -P2P NVE content distribution -P2P NVE 3D streaming -P2P NVE voice communications -P2P NVE architecture designs -P2P NVE prototypes -P2P NVE consistency control -Persistent storage for P2P NVEs -Security and cheat-prevention mechanisms for P2P games -P2P control for mobile NVEs -P2P NVE applications on mobile devices Important Dates Submission: July 15, 2009 Notification: September 1, 2009 Camera ready: October 1, 2009 Paper Submission Authors are invited to submit an electronic version of original, unpublished manuscripts, not to exceed 8 double-columned, single-spaced pages, to web site http://acnlab.csie.ncu.edu.tw/P2PNVE2009. Submitted papers should be in be in PDF format in accordance with IEEE Computer Society guidelines (Wordor Latex ). All submitted papers will be refereed by reviewers in terms of originality, contribution, correctness, and presentation. -------------- next part -------------- An HTML attachment was scrubbed... URL: