From gmpc at sanger.ac.uk Tue Jun 1 05:03:15 2010 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue, 01 Jun 2010 13:03:15 +0100 Subject: [Beowulf] cluster scheduler for dynamic tree-structured jobs? In-Reply-To: <20100515102454.GA99295@piskorski.com> References: <20100515102454.GA99295@piskorski.com> Message-ID: <4C04F703.9010700@sanger.ac.uk> On 15/05/10 11:24, Andrew Piskorski wrote: > Folks, I could use some advice on which cluster job scheduler (batch > queuing system) would be most appropriate for my particular needs. > I've looked through docs for SGE, Slurm, etc., but without first-hand > experience with each one it's not at all clear to me which I should > choose... > This may be late in the day but... If you job dependencies are too complicated for you queuing system to deal with, you may want to look at the Ensembl Hive system; http://www.ensembl.org/info/docs/eHive/index.html It is the system we use in-house for our genome-analysis pipelines, which have lots of complicated dependencies. It sits on top of a traditional queuing system which handles job-dispatch etc. It has been de-coupled from the genome analysis workflow, so you should (in theory) be able to use it for any analysis. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rreis at aero.ist.utl.pt Tue Jun 1 23:44:41 2010 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Wed, 2 Jun 2010 07:44:41 +0100 (WEST) Subject: [Beowulf] Top 500 in the BBC Message-ID: http://infosthetics.com/archives/2010/05/bbc_news_visualizing_the_top_500_supercomputer_report.html best regards, Ricardo Reis 'Non Serviam' PhD candidate @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt Cultural Instigator @ R?dio Zero http://www.radiozero.pt Keep them Flying! Ajude a/help Aero F?nix! http://www.aeronauta.com/aero.fenix http://www.flickr.com/photos/rreis/ < sent with alpine 2.00 > From sabujp at gmail.com Thu Jun 3 20:01:23 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Thu, 3 Jun 2010 22:01:23 -0500 Subject: [Beowulf] recommendations for parallel IO In-Reply-To: References: Message-ID: > ?We need an open source solution, we are looking into PVFS and Gluster (but > from what we see, Gluster doesn't quit fit the bill? It's more a distributed > filesystem than a parallel filesystem... or are we taking the wrong turn on > our reasoning, somewhere about this?) gluster in stripe mode is parallel. It can also be distributed or distributed and parallel, mirrored, etc. From pal at di.fct.unl.pt Fri Jun 4 04:45:05 2010 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Fri, 4 Jun 2010 12:45:05 +0100 (WEST) Subject: [Beowulf] recommendations for parallel IO In-Reply-To: References: Message-ID: <61056.193.136.122.17.1275651905.squirrel@webmail.fct.unl.pt> Oi, Ricardo. > > Hi all > > We have a small cluster but some users need to use MPI-IO. We have a > NFS3 > shared partition but you would need to mount it with special options who > would hurt performance. Yes... options include all the available ways to enforce "no client caching" and that is (usually) very bad for performance :-) There's also NFS4.1 but I can't speak about it other than the last time (> 6 months) I looked, it was VERY OS dependent (you had to run kernel 2.6.x.y.z); furthermore, I haven't looked at the MPI-IO support status on 4.1. > We are looking into a nice parallel file system to > deploy in this context. We got 4 boxes with a 500Gb disk in each, for the Are the 4 boxes just for the filesystem service, or are they "the small cluster" ? > moment, connected with Gb. We have another Gb connection dedicated to the > MPI traffic. > > We need an open source solution, we are looking into PVFS I am using it and I have very good experiences with PVFS: easy installation, support -- !excellent! -- and good performance; the only minuses are 1) CPU use when you used GbE "dumb" cards (those usually integrated in the mobo) and 2) some limitations on the POSIX interface (which should not hurt you, as you're going the MPI-IO way). If the 4 nodes are both the compute and I/O nodes, then (1) above will hurt your applications iff they overlap I/O and computation. > and Gluster > (but from what we see, Gluster doesn't quit fit the bill? It's more a > distributed filesystem than a parallel filesystem... or are we taking the > wrong turn on our reasoning, somewhere about this?) > Yes, you're right, me thinks :-) However, I have no experience You could also look at Lustre. Back in the days (of CFS, Inc.) where there were 2 versions, the free and the not-free, the free was a nightmare to install (been there), had quite a few bugs, and was always months behind the non-free, but I am told that when Sun picked it they changed that, and there is only one version (altough there are some mumbles about Oracle's doing this and that) Abra?o paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: poral at fct.unl.pt 2829-516 Caparica, PORTUGAL From gus at ldeo.columbia.edu Fri Jun 4 09:43:35 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Jun 2010 12:43:35 -0400 Subject: [Beowulf] Top 500 in the BBC In-Reply-To: References: Message-ID: <4C092D37.9060303@ldeo.columbia.edu> Ola' Ricardo, That's really nice! Here's the BBC link also: http://news.bbc.co.uk/2/hi/10187248.stm Na~o ha' nada como o ra'dio para a difusa~o da informaca~o! Abrac,o Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Ricardo Reis wrote: > > http://infosthetics.com/archives/2010/05/bbc_news_visualizing_the_top_500_supercomputer_report.html > > > best regards, > > Ricardo Reis > > 'Non Serviam' > > PhD candidate @ Lasef > Computational Fluid Dynamics, High Performance Computing, Turbulence > http://www.lasef.ist.utl.pt > > Cultural Instigator @ R?dio Zero > http://www.radiozero.pt > > Keep them Flying! Ajude a/help Aero F?nix! > > http://www.aeronauta.com/aero.fenix > > http://www.flickr.com/photos/rreis/ > > < sent with alpine 2.00 > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Tue Jun 8 10:44:55 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 08 Jun 2010 10:44:55 -0700 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: This is off topic so I will try to keep it short: is there an "archival" format for large binary files which contains enough error correction to that all original data may be recovered even if there is a little data loss in the storage media? For my purposes these are disk images, sometimes .tar.gz, other times gunzip -c of dd dumps of whole partitions which have been "cleared" by filling the empty space with one big file full of zero, and then that file deleted. I'm thinking of putting this information on DVD's (only need to keep it for a few years at a time) but I don't trust that media not to lose a sector here or there - having watched far too many scratched DVD movies with playback problems. Unlike an SDLT with a bad section, the good parts of a DVD are still readable when there is a bad block (using dd or ddrescue) but of course even a single missing chunk makes it impossible to decompress a .gz file correctly. So what I'm looking for is some sort of .img.gz.ecc format, where the .ecc puts in enough redundant information to recover the underlying img.gz even when sectors or data are missing. If no such tool/format exists then two copies should be enough to recover all of an .img.gz so long as the same data wasn't lost on both media, and if bad DVD sectors always come back as "failed read", never ever showing up as a good read but actually containing bad data. Perhaps the frame checksum on a DVD is enough to guarantee that? Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mdidomenico4 at gmail.com Tue Jun 8 11:05:19 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 8 Jun 2010 14:05:19 -0400 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: What's the ramification of losing a block? (ie file-system won't mount, data has a hole) Not that it's elegant, the first thing that pops to mind is using 'split' to chunk the file into many little bits and then md5 each bit On Tue, Jun 8, 2010 at 1:44 PM, David Mathog wrote: > This is off topic so I will try to keep it short: ?is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there is a > little data loss in the storage media? > > For my purposes these are disk images, sometimes .tar.gz, other times > gunzip -c of dd dumps of whole partitions which have been "cleared" by > filling the empty space with one big file full of zero, and then that > file deleted. ?I'm thinking of putting this information on DVD's (only > need to keep it for a few years at a time) but I don't trust that media > not to lose a sector here or there - having watched far too many > scratched DVD movies with playback problems. > > Unlike an SDLT with a bad section, the good parts of a DVD are still > readable when there is a bad block (using dd or ddrescue) but of course > even a single missing chunk makes it impossible to decompress a .gz file > correctly. ?So what I'm looking for is some sort of .img.gz.ecc format, > where the .ecc puts in enough redundant information to recover the > underlying img.gz even when sectors or data are missing. ? If no such > tool/format exists then two copies should be enough to recover all of an > .img.gz so long as the same data wasn't lost on both media, and if bad > DVD sectors always come back as "failed read", never ever showing up as > a good read but actually containing bad data. ?Perhaps the frame > checksum on a DVD is enough to guarantee that? > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From reuti at staff.uni-marburg.de Tue Jun 8 12:03:59 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Tue, 8 Jun 2010 21:03:59 +0200 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <230D2E83-0269-4512-B9F1-54D273327888@staff.uni-marburg.de> Hi, Am 08.06.2010 um 19:44 schrieb David Mathog: > This is off topic so I will try to keep it short: is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there > is a > little data loss in the storage media? > > For my purposes these are disk images, sometimes .tar.gz, other times > gunzip -c of dd dumps of whole partitions which have been "cleared" by > filling the empty space with one big file full of zero, and then that > file deleted. I'm thinking of putting this information on DVD's (only > need to keep it for a few years at a time) but I don't trust that > media > not to lose a sector here or there - having watched far too many > scratched DVD movies with playback problems. > > Unlike an SDLT with a bad section, the good parts of a DVD are still > readable when there is a bad block (using dd or ddrescue) but of > course > even a single missing chunk makes it impossible to decompress a .gz > file > correctly. So what I'm looking for is some sort of .img.gz.ecc > format, > where the .ecc puts in enough redundant information to recover the > underlying img.gz even when sectors or data are missing. If no such > tool/format exists then two copies should be enough to recover all > of an > .img.gz so long as the same data wasn't lost on both media, and if bad > DVD sectors always come back as "failed read", never ever showing up > as > a good read but actually containing bad data. Perhaps the frame > checksum on a DVD is enough to guarantee that? besides splitting the file, I would suggest to generate some par/par2 files. This format was originally used on the Usene, to have a reliable way to transfer binary attachements. I.e. first you split your files into e.g. 10 pieces each and generate 5 par/par2 files for each of them. Then you need any 10 out of these 15 into total to be good to recover the original file. http://en.wikipedia.org/wiki/Parchive -- Reuti > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From beckerjes at mail.nih.gov Tue Jun 8 17:49:57 2010 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Tue, 8 Jun 2010 20:49:57 -0400 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <20100609004957.GM26589@mail.nih.gov> I came across this page a few years back that discusses this very problem: http://users.softlab.ntua.gr/~ttsiod/rsbep.html On Tue, Jun 08, 2010 at 01:44:55PM -0400, David Mathog wrote: >This is off topic so I will try to keep it short: is there an >"archival" format for large binary files which contains enough error >correction to that all original data may be recovered even if there is a >little data loss in the storage media? > >For my purposes these are disk images, sometimes .tar.gz, other times >gunzip -c of dd dumps of whole partitions which have been "cleared" by >filling the empty space with one big file full of zero, and then that >file deleted. I'm thinking of putting this information on DVD's (only >need to keep it for a few years at a time) but I don't trust that media >not to lose a sector here or there - having watched far too many >scratched DVD movies with playback problems. > >Unlike an SDLT with a bad section, the good parts of a DVD are still >readable when there is a bad block (using dd or ddrescue) but of course >even a single missing chunk makes it impossible to decompress a .gz file >correctly. So what I'm looking for is some sort of .img.gz.ecc format, >where the .ecc puts in enough redundant information to recover the >underlying img.gz even when sectors or data are missing. If no such >tool/format exists then two copies should be enough to recover all of an >.img.gz so long as the same data wasn't lost on both media, and if bad >DVD sectors always come back as "failed read", never ever showing up as >a good read but actually containing bad data. Perhaps the frame >checksum on a DVD is enough to guarantee that? > >Thanks, > >David Mathog >mathog at caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jesse Becker NHGRI Linux support (Digicon Contractor) From kilian.cavalotti.work at gmail.com Wed Jun 9 00:33:16 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 9 Jun 2010 09:33:16 +0200 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: On Tue, Jun 8, 2010 at 8:05 PM, Michael Di Domenico wrote: > Not that it's elegant, the first thing that pops to mind is using > 'split' to chunk the file into many little bits and then md5 each bit While this may let you know that a file has been corrupted, it won't help recovering that file. Some compression algorithms, which may be considered as storage algorithms if you turn compression off, have options to create recovery records. For instance, in the RAR format (http://en.wikipedia.org/wiki/RAR), you can choose how much redundant data you want to include in your archive (whose size will be increased accordingly). Excerpt from Alexander Roshal's rar user's manual: """ rr[N] Add data recovery record. Optionally, redundant information (recovery record) may be added to an archive. This will cause a small increase of the archive size and helps to recover archived files in case of floppy disk failure or data losses of any other kind. A recovery record contains up to 524288 recovery sectors. The number of sectors may be specified directly in the 'rr' command (N = 1, 2 .. 524288) or, if it is not specified by the user, it will be selected automatically according to the archive size: a size of the recovery information will be about 1% of the total archive size, usually allowing the recovery of up to 0.6% of the total archive size of continuously damaged data. It is also possible to specify the recovery record size in percent to the archive size. Just append the percent character to the command parameter. For example: rar rr3% arcname If data is damaged continuously, then each rr-sector helps to recover 512 bytes of damaged information. This value may be lower in cases of multiple damage. """ Cheers, -- Kilian From mathog at caltech.edu Thu Jun 10 12:20:39 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 10 Jun 2010 12:20:39 -0700 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: Jesse Becker and others suggested: > http://users.softlab.ntua.gr/~ttsiod/rsbep.html I tried it and it works, mostly, but definitely has some warts. To start with I gave it a negative control - a file so badly corrupted it should NOT have been able to recover it. % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig % cat img.orig | bzip2 >img.bz2.orig % cat img.bz2.orig | rsbep > img.bz2.rsbep % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000 >img.bz2.rsbep.pox % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored rsbep: number of corrected failures : 9725096 rsbep: number of uncorrectable blocks : 0 img.orig is a Windows XP partition with all empty space filled with 0x0 bytes. That is then compressed with bzip2, then run through rsbep (the one from the link above), then corrupted with pockmark. Pockmark is my own little concoction, when used as shown it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for a run of (1-MAXRUN). In both cases the gap and run length are chosen at random from those ranges for each new gap/run. This should corrupt around 10% of the file, which I assumed would render it unrecoverable. Notice in the file sizes below that the overall size did not change when the file was run through pockmark. rsbep did not note any errors it couldn't correct. However, the size of the restored file is not the same as the orig. 4056976560 2010-06-08 17:51 img.bz2.restored 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox 4639143600 2010-06-08 16:13 img.bz2.rsbep 4056879025 2010-06-08 14:40 img.bz2.orig 20974431744 2010-06-07 15:23 img.orig % bunzip2 -tvv img.bz2.restored img.bz2.restored: [1: huff+mtf data integrity (CRC) error in data So at the very least rsbep sometimes says it has recovered a file when it has not. I didn't really expect it to rescue this particular input, but it really should have handled it better. I reran it with a less damaged file like this: % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000 >img.bz2.rsbep.pox2 % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2 rsbep: number of corrected failures : 46025036 rsbep: number of uncorrectable blocks : 0 % bunzip2 img.bz2.restored2 bunzip2: Can't guess original name for img.bz2.restored2 -- using img.bz2.restored2.out bunzip2: img.bz2.restored2: trailing garbage after EOF ignored % md5sum img.bz2.restored2.out img.orig 7fbaec7143c3a17a31295a803641aa3c img.bz2.restored2.out 7fbaec7143c3a17a31295a803641aa3c img.orig This time it was able to recover the corrupted file, but again, it created an output file which was a different size. Is this always the case? Seems to be at least for the size file used here: % cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2 nopox.bz2 is also 4056976560. The decoded output is always 97535 bytes larger than the original, which may bear some relation to the z=ERR_BURST_LEN parameter as: 97535 /765 = 127.496732 which is suspiciously close to 255/2. Or that could just be a coincidence. In any case, bunzip2 was able to handle the crud on the end, but this would have been a problem for other binary files. Tbe other thing that is frankly bizarre is the number of "corrected" failures for the 2nd case vs. the first. The 2nd should have 10X fewer bad bytes than the first, but the rsbep status messages indicate 4.73X MORE. However, the number of bad bytes in the 2nd is almost exactly 1%, as it should be. All of this suggests that rsbep does not handle correctly files which are "too" corrupted. It gives the wrong number of corrected blocks and thinks that it has corrected everything when it has not done so. Worse, even when it does work the output file was never (in any of the test cases) the same size as the input file. I think this program has potential but it needs a bit of work to sand the rough edges off. I will have a look at it, but won't have a chance to do so for a couple of weeks. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From atchley at myri.com Thu Jun 10 13:11:36 2010 From: atchley at myri.com (Scott Atchley) Date: Thu, 10 Jun 2010 16:11:36 -0400 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <5E6C3D74-439A-41EE-8CE6-FAF5CE8888C7@myri.com> On Jun 10, 2010, at 3:20 PM, David Mathog wrote: > Jesse Becker and others suggested: > >> http://users.softlab.ntua.gr/~ttsiod/rsbep.html > > I tried it and it works, mostly, but definitely has some warts. > > To start with I gave it a negative control - a file so badly corrupted > it should NOT have been able to recover it. > > % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig > % cat img.orig | bzip2 >img.bz2.orig > % cat img.bz2.orig | rsbep > img.bz2.rsbep > % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000 >> img.bz2.rsbep.pox > % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored > rsbep: number of corrected failures : 9725096 > rsbep: number of uncorrectable blocks : 0 > > img.orig is a Windows XP partition with all empty space filled with > 0x0 bytes. That is then compressed with bzip2, then run > through rsbep (the one from the link above), then corrupted > with pockmark. Pockmark is my own little concoction, when used as > shown it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for > a run of (1-MAXRUN). In both cases the gap and run length are chosen at > random from those ranges for each new gap/run. > This should corrupt around 10% of the file, which I assumed would render > it unrecoverable. Notice in the file sizes below that the overall size > did not change when the file was run through pockmark. rsbep did not > note any errors it couldn't correct. However, the > size of the restored file is not the same as the orig. > > 4056976560 2010-06-08 17:51 img.bz2.restored > 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox > 4639143600 2010-06-08 16:13 img.bz2.rsbep > 4056879025 2010-06-08 14:40 img.bz2.orig > 20974431744 2010-06-07 15:23 img.orig > > % bunzip2 -tvv img.bz2.restored > img.bz2.restored: > [1: huff+mtf data integrity (CRC) error in data > > So at the very least rsbep sometimes says it has recovered a file when > it has not. I didn't really expect it to rescue this particular input, > but it really should have handled it better. I have never used this tool, but I would wonder if your pockmark tool damaged the rsbep metadata, specifically one or more of the metadata segment lengths. Bear in mind that corruption of the metadata is not beyond the realm of possibility, but I assume that the rsbep metadata is not replicated or otherwise protected. > I reran it with a less damaged file like this: > > % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000 >> img.bz2.rsbep.pox2 > % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2 > rsbep: number of corrected failures : 46025036 > rsbep: number of uncorrectable blocks : 0 > % bunzip2 img.bz2.restored2 > bunzip2: Can't guess original name for img.bz2.restored2 -- using > img.bz2.restored2.out > bunzip2: img.bz2.restored2: trailing garbage after EOF ignored > % md5sum img.bz2.restored2.out img.orig > 7fbaec7143c3a17a31295a803641aa3c img.bz2.restored2.out > 7fbaec7143c3a17a31295a803641aa3c img.orig > > This time it was able to recover the corrupted file, but again, it > created an output file which was a different size. Is this always the > case? Seems to be at least for the size file used here: > > % cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2 > > nopox.bz2 is also 4056976560. The decoded output is always 97535 bytes > larger than the original, which may bear some relation to the > z=ERR_BURST_LEN parameter as: > > 97535 /765 = 127.496732 > > which is suspiciously close to 255/2. Or that could just be a coincidence. > > In any case, bunzip2 was able to handle the crud on the end, but this > would have been a problem for other binary files. This is most likely a requirement of the underlying Reed-Solomon library that requires equal length blocksizes. If your original file is N bytes and N % M != 0 where M is the blocksize, I imagine it pads the last block with 0s so that it is N bytes. It should not affect bunzip since the length is encoded in the file and it ignores anything tacked onto the end. A quick glance at his website, it claims that the length should be the same. He only shows, however, the md5sums and not the ls -l output. Scott > Tbe other thing that is frankly bizarre is the number of "corrected" > failures for the 2nd case vs. the first. The 2nd should have 10X > fewer bad bytes than the first, but the rsbep status messages > indicate 4.73X MORE. However, the number of bad bytes in the 2nd is > almost exactly 1%, as it should be. All of this suggests that rsbep > does not handle correctly files which are "too" corrupted. It gives the > wrong number of corrected blocks and thinks that it has corrected > everything when it has not done so. Worse, even when it does work the > output file was never (in any of the test cases) the same size as the > input file. > > I think this program has potential but it needs a bit of work to sand > the rough edges off. I will have a look at it, but won't have a chance > to do so for a couple of weeks. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Thu Jun 10 14:43:21 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 10 Jun 2010 14:43:21 -0700 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: Scott Atchley wrote: > I have never used this tool, but I would wonder if your pockmark tool damaged the rsbep metadata, specifically one or more of the metadata segment lengths. Bear in mind that corruption of the metadata is not beyond the realm of possibility, but I assume that the rsbep metadata is not replicated or otherwise protected. pockmark just stomps on random parts of the file, so the metadata is as open to destruction as anything else. Presumably that shouldn't be an issue for this sort of program though - the metadata should also be protected in some manner. > > In any case, bunzip2 was able to handle the crud on the end, but this > > would have been a problem for other binary files. > > This is most likely a requirement of the underlying Reed-Solomon library that requires equal length blocksizes. If your original file is N bytes and N % M != 0 where M is the blocksize, I imagine it pads the last block with 0s so that it is N bytes. It should not affect bunzip since the length is encoded in the file and it ignores anything tacked onto the end. bunzip2 id not affected, but it is not a good thing to do in general. Not all binary files will be functionally equivalent after null bytes are added on the end! > > A quick glance at his website, it claims that the length should be the same. He only shows, however, the md5sums and not the ls -l output. I forwarded my observations to the program's author and suggested that if I ran the program incorrectly, or he finds these really are bugs, that he post back here with corrections. I tried rsbep again with a test file of size 81920000 bytes (much less than 32bits unsigned, the first test file was larger in bytes than 32 bits unsigned) but similar problems arose. One difference, for the smaller test file the restored files were 240842 bytes bigger, not 97535 like before. My guess is that since the program dates back to the age of very small media it may be using "int" or "long" in locations where "long long" is needed today. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From dnlombar at ichips.intel.com Thu Jun 10 15:11:56 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 10 Jun 2010 15:11:56 -0700 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <20100610221155.GA30172@nlxcldnl2.cl.intel.com> On Thu, Jun 10, 2010 at 12:20:39PM -0700, David Mathog wrote: > Jesse Becker and others suggested: > > > http://users.softlab.ntua.gr/~ttsiod/rsbep.html > > I tried it and it works, mostly, but definitely has some warts. > > To start with I gave it a negative control - a file so badly corrupted > it should NOT have been able to recover it. > > % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig > % cat img.orig | bzip2 >img.bz2.orig > % cat img.bz2.orig | rsbep > img.bz2.rsbep > % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000 > >img.bz2.rsbep.pox > % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored > rsbep: number of corrected failures : 9725096 > rsbep: number of uncorrectable blocks : 0 > > img.orig is a Windows XP partition with all empty space filled with > 0x0 bytes. That is then compressed with bzip2, then run > through rsbep (the one from the link above), then corrupted > with pockmark. Pockmark is my own little concoction, when used as > shown it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for > a run of (1-MAXRUN). In both cases the gap and run length are chosen at > random from those ranges for each new gap/run. The website is more interested in corrupted block media, with the assumption said corruption manifests as a cluster of invalid blocks from the file. You've got a different type of corruption. > % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000 > >img.bz2.rsbep.pox2 > % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2 > rsbep: number of corrected failures : 46025036 > rsbep: number of uncorrectable blocks : 0 > % bunzip2 img.bz2.restored2 > bunzip2: Can't guess original name for img.bz2.restored2 -- using > img.bz2.restored2.out > bunzip2: img.bz2.restored2: trailing garbage after EOF ignored > % md5sum img.bz2.restored2.out img.orig > 7fbaec7143c3a17a31295a803641aa3c img.bz2.restored2.out > 7fbaec7143c3a17a31295a803641aa3c img.orig He documents a "freeze.sh" and "melt.sh" (in the contrib dir) that wrap rsbep and rsbep_chopper. That's very different from what you did. [dnl at closter ~]$ ls -l junk.avi -rw-rw-r-- 1 dnl dnl 12622344 2010-02-28 19:17 junk.avi [dnl at closter ~]$ freeze junk.avi > freeze1 [dnl at closter ~]$ melt freeze1 > melt1 [dnl at closter ~]$ md5sum freeze1 junk.avi melt1 4f8052c358e5bd86b9bfffd980726940 junk.avi dcbeafa75ec60f50d003876866009213 freeze1 4f8052c358e5bd86b9bfffd980726940 melt1 [dnl at closter ~]$ ls -l junk.avi freeze1 melt1 -rw-rw-r-- 1 dnl dnl 14565600 2010-06-10 14:56 freeze1 -rw-rw-r-- 1 dnl dnl 12622344 2010-02-28 19:17 junk.avi -rw-rw-r-- 1 dnl dnl 12622344 2010-06-10 14:56 melt1 [dnl at closter ~]$ The above only shows a single example of it not damaging an intact file. But I only played with it for about 10m last night and the above is a laugh test. I'll start proper testing tonight... -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From james.p.lux at jpl.nasa.gov Thu Jun 10 16:39:00 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 10 Jun 2010 16:39:00 -0700 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: <20100610221155.GA30172@nlxcldnl2.cl.intel.com> References: <20100610221155.GA30172@nlxcldnl2.cl.intel.com> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David N. Lombard > Sent: Thursday, June 10, 2010 3:12 PM > To: David Mathog > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] OT: recoverable optical media archive format? > > > The website is more interested in corrupted block media, with the assumption > said corruption manifests as a cluster of invalid blocks from the file. You've > got a different type of corruption. > In the communications field one uses interleaving as a way to turn burst errors into isolated errors. One then uses a short length coding scheme to detect and correct the isolated errors. On systems where the errors are sporadic (communication links with additive white Gaussian noise, or RAM that gets sporadic single bit flips), then interleaving doesn't buy you anything, and you can use short codes like Hamming or Reed-Solomon/BCH. On systems where errors are transient (read errors from flash.. read the same location again a second time and it's ok) or where you can ask for a repeat, then block oriented "go back N" codes work well. Unless you have a timing determinism requirement or the retry interval is very long (8 hour light time to Pluto), and then, some sort of redundant block scheme (send every block three times in a row) gets used. A good practical example is the coding used on CDs... the error correcting code is a Reed Solomon, which does real well at isolated errors, but not great at burst errors. So they use a R-S code with a block interleave scheme in front of it and then another R-S code, because the error statistics from CD drives show burst errors. So it looks like you are looking for the inverse.. something that turns distributed errors into clumps? From Tina.Friedrich at diamond.ac.uk Fri Jun 4 01:39:38 2010 From: Tina.Friedrich at diamond.ac.uk (Tina Friedrich) Date: Fri, 04 Jun 2010 09:39:38 +0100 Subject: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity In-Reply-To: <20100525194056.GB16022@kaizen.mayo.edu> References: <201005251900.o4PJ0ElP016422@bluewest.scyld.com> <20100525194056.GB16022@kaizen.mayo.edu> Message-ID: <4C08BBCA.5070508@diamond.ac.uk> We've had that happen on some of our servers. Currently using the disable_msi workaround, which seems to have stopped it. I believe there's supposed to be a fix in the latest Red Hat kernel but we haven't really tested that yet. You loose all network connectivity (including IPMI) to the server - not all connectivity, so e.g. serial console (not SOL, proper serial console, or using a console server) still works (as would a locally attached keyboard/monitor). Unless you require network to log in :) . If one runs into this, it's a really weird one (before you find the bug report) - to all appearances, the server works happily, no strangeness in the logs - just network gone completely. It's not one to trigger easily - hard to track down sort of thing. Had 610s and 710s for a while before this first happened (and loads we never saw it on, still). We first saw it on a rather heavily used NFS server (i.e. lots of network I/O). Tina Cris Rhea wrote: >> In case it helps anyone using Dell R410 / 610 / 710 etc. servers: I have had >> machines lose their eth connections periodically (CentOS 5.4 bnx2 driver). >> Seems like a bug with the Broadcom NIC drivers. [luckily read of it on a >> Dell mailing list] >> >> Bug Reports: >> >> http://kbase.redhat.com/faq/docs/DOC-26837 >> http://patchwork.ozlabs.org/patch/51106 >> >> Not sure yet if this is exactly my issue but I'm giving it a shot now. >> Thought I'd post since, anecdotally I've seen many people use these servers >> on the list. >> >> -- >> Rahul > > I've been following this on the Dell list as I have approx. 50 R410s > in our cluster. > > One thing that isn't clear-- When this happens, do you lose all > connectivity to the node (i.e., do you have to reboot the node to > re-establish eth0)? > > My R410s are running CentOS 5.2 - 5.4 and I rarely have one go > down. > > --- Cris > > -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 From mb at gup.jku.at Tue Jun 8 11:31:17 2010 From: mb at gup.jku.at (Markus Baumgartner) Date: Tue, 08 Jun 2010 20:31:17 +0200 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <4C0E8C75.8020609@gup.jku.at> David Mathog wrote: > This is off topic so I will try to keep it short: is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there is a > little data loss in the storage media? > > You could check out http://dvdisaster.net/en/index.html From rtomek at ceti.com.pl Tue Jun 8 12:19:44 2010 From: rtomek at ceti.com.pl (Tomasz Rola) Date: Tue, 8 Jun 2010 21:19:44 +0200 (CEST) Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: On Tue, 8 Jun 2010, David Mathog wrote: > This is off topic so I will try to keep it short: is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there is a > little data loss in the storage media? > > For my purposes these are disk images, sometimes .tar.gz, other times > gunzip -c of dd dumps of whole partitions which have been "cleared" by > filling the empty space with one big file full of zero, and then that > file deleted. I'm thinking of putting this information on DVD's (only > need to keep it for a few years at a time) but I don't trust that media > not to lose a sector here or there - having watched far too many > scratched DVD movies with playback problems. > > Unlike an SDLT with a bad section, the good parts of a DVD are still > readable when there is a bad block (using dd or ddrescue) but of course > even a single missing chunk makes it impossible to decompress a .gz file > correctly. So what I'm looking for is some sort of .img.gz.ecc format, > where the .ecc puts in enough redundant information to recover the > underlying img.gz even when sectors or data are missing. If no such > tool/format exists then two copies should be enough to recover all of an > .img.gz so long as the same data wasn't lost on both media, and if bad > DVD sectors always come back as "failed read", never ever showing up as > a good read but actually containing bad data. Perhaps the frame > checksum on a DVD is enough to guarantee that? I use tar, gzip/bzip2, split - for creating a number of files of more or less similar lenghts (like, 50 megs or 100 megs, but usually 50). After that, I make par2 recovery files with par2cmdline tools (they make use of Solomon-Reed error correction) http://en.wikipedia.org/wiki/Parchive http://parchive.sourceforge.net/ I am unable to find par2cmdline via google ATM, but they should be somewhere. And last but not least, I burn it all (data + pars). HTH. Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola at bigfoot.com ** From dkimpe at mcs.anl.gov Wed Jun 9 04:05:13 2010 From: dkimpe at mcs.anl.gov (Dries Kimpe) Date: Wed, 9 Jun 2010 11:05:13 +0000 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <20100609110513.GA4359@X300.rhi.hi.is> * David Mathog [2010-06-08 10:44:55]: > This is off topic so I will try to keep it short: is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there is a > little data loss in the storage media? > For my purposes these are disk images, sometimes .tar.gz, other times > gunzip -c of dd dumps of whole partitions which have been "cleared" by > filling the empty space with one big file full of zero, and then that > file deleted. I'm thinking of putting this information on DVD's (only > need to keep it for a few years at a time) but I don't trust that media > not to lose a sector here or there - having watched far too many > scratched DVD movies with playback problems. > Unlike an SDLT with a bad section, the good parts of a DVD are still > readable when there is a bad block (using dd or ddrescue) but of course > even a single missing chunk makes it impossible to decompress a .gz file > correctly. So what I'm looking for is some sort of .img.gz.ecc format, > where the .ecc puts in enough redundant information to recover the > underlying img.gz even when sectors or data are missing. If no such > tool/format exists then two copies should be enough to recover all of an > .img.gz so long as the same data wasn't lost on both media, and if bad > DVD sectors always come back as "failed read", never ever showing up as > a good read but actually containing bad data. Perhaps the frame > checksum on a DVD is enough to guarantee that? You should also consider protecting the metadata of the filesystem; I.e. what good does it do to have split files, correction data, ... if it cannot find the file any longer because the damaged sector was in the directory metadata, not in the actual file data? RAR has 'recovery record' support that is tunable (you can pick how much space you want to sacrifice to recovery). You could pack everything in a rar file (with recovery records turned on) and write the whole file directly to the dvd (i.e. using dd or growisofs -Z /dev/dvd=rarfile). The downside is that the filesize will not be preserved, you'd have to check if unrar can deal with this or if it requires the file size to be known. A quick test with a small rar archive seems to indicate that it does not care if extra data is added at the end of the file. Dries -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From stuart at cyberdelix.net Wed Jun 9 07:17:01 2010 From: stuart at cyberdelix.net (lsi) Date: Wed, 09 Jun 2010 15:17:01 +0100 Subject: [Beowulf] OT: recoverable optical media archive format? In-Reply-To: References: Message-ID: <4C0FA25D.949.37AA4C7A@stuart.cyberdelix.net> The ECC approach is nice. My current solution is to burn two copies of each archive DVD. If the media deteriorates and there are unreadable sectors, I use the second copy of the DVD to replace the dead sectors. This is done using software which has a Bad Sector Mapping function, and a Patch File function. I have used Media Doctor to do this job. I wrote it up here: http://www.cyberdelix.net/tech/recover_cd_dvd.htm This said, I want to dump DVD as an archive format (I find that only certain drives will read certain DVDs, total PITA), I'm considering using HDDs, possibly 2.5" in a RAID configuration, in a NAS which is only fired up when needed. I suspect nowadays, 2.5" drives are more space-efficient, although I haven't done the sums. Stu On 8 Jun 2010 at 10:44, David Mathog wrote: > This is off topic so I will try to keep it short: is there an > "archival" format for large binary files which contains enough error > correction to that all original data may be recovered even if there is a > little data loss in the storage media? --- Stuart Udall stuart at at cyberdelix.dot net - http://www.cyberdelix.net/ --- * Origin: lsi: revolution through evolution (192:168/0.2) From ttsiodras at gmail.com Fri Jun 11 10:16:22 2010 From: ttsiodras at gmail.com (Thanassis Tsiodras) Date: Fri, 11 Jun 2010 20:16:22 +0300 Subject: [Beowulf] Re: rsbep issues In-Reply-To: References: Message-ID: I don't know if adding the beowulf mailing list address to CC will work - if it doesn't, please forward this response on my behalf. Your question abou the output size: Due to the mechanics of reed-solomon, if we want to be able to recover from sector errors, our input split must be (a) split in large sections and (b) interleaved. You can read the relevant theory from my "Algorithm" section on my rsbep page (http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html) where I explain why we need to interleave the stream... So, rsbep, on its own, creates outputs of multiples of 1040400. This won't do, of course - we don't want to see garbage after our "removal of shield" (piped to our gzip / bzip2 / lzma / whatever) ... So what does my package do? It installs two helper scripts (freeze.sh / melt.sh) which add a simple header on the stream, BEFORE shielding it with Reed-Solomon, so that the decoding side uses this size to "chop" off the extra cruft at the end... So if you use the tools as described in the packaged README, and as described on the site (that is, via freeze.sh/melt.sh) then you won't see the "garbage at the end" problem. The other thing you report, about the "silent corruption", is serious. If you check my configure.ac, you'll see that it checks for GCC version: # Check for bad GCC version (4.4.x create bad code for rs32.c) AX_GCC_VERSION if test `expr substr $GCC_VERSION 1 3` == "4.4" ; then AC_MSG_ERROR([GCC Series 4.4.x generate bad code... Please use 4.3.x instead]) fi Unfortunately, it's not just 4.4 - something has changed after GCC 4.3 that breaks the rsbep code... You can either use the 4.3.x series, or compile in plain-C mode (hardcode X86ASM to "no"). I use Debian, where the stable GCC version is 4.3.4, and therefore I don't see this bug... I updated autoconf check to stop the build if it detects GCC >= 4.4. That's all I can do for now... if you have the time/resources to figure out why the code broke after this GCC version, I'd happily publish your patches... (I use Debian, where the stable GCC is 4.3.4, and this problem doesn't manifest): bash$ dd if=/dev/urandom of=data bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 20.0936 s, 5.2 MB/s bash$ freeze.sh data > data.shielded bash$ dd if=/dev/zero of=data.shielded conv=notrunc bs=512 count=127 bash$ melt.sh data.shielded | md5sum rsbep: number of corrected failures : 64784 rsbep: number of uncorrectable blocks : 0 b55c7886465da082c4949698858d342c - bash$ cat data | md5sum b55c7886465da082c4949698858d342c - So the data were recovered fine, even after a loss of 127 consecutive sectors. I hope this helps... Kind regards, Thanassis Tsiodras, Dr.-Ing. On Thu, Jun 10, 2010 at 10:29 PM, David Mathog wrote: > Hi, > > I posted in the beowulf mailing list (beowulf at beowulf.org) asking for > suggestions for a program that would ?allow recovery from corrupted > media, and your rsbep variant was suggested. ?So I gave it a try, and it > had some issues with the test data I ran. ?Possibly because the files > were so large? > > It was built on a Mandriva 2008.1 system with: > > ./configure --prefix=/usr/common > make > make install > > I posted my observations to the beowulf mailing list, but am including a > copy of them below my signature. ?Perhaps you might want to respond to > that list with corrections to whatever I did wrong, or a notice of a > patched version of the program, if these really are bugs. > > I will send you a copy of the pockmark program in a separate email. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------- Forwarded message follows ------------- > > Jesse Becker and others suggested: > >> ? ? http://users.softlab.ntua.gr/~ttsiod/rsbep.html > > I tried it and it works, mostly, but definitely has some warts. > > To start with I gave it a negative control - a file so badly corrupted > it should NOT have been able to recover it. > > % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig > % cat img.orig ? ? ?| bzip2 >img.bz2.orig > % cat img.bz2.orig ?| rsbep > img.bz2.rsbep > % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000 >>img.bz2.rsbep.pox > % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored > rsbep: number of corrected failures ? : 9725096 > rsbep: number of uncorrectable blocks : 0 > > img.orig is a Windows XP partition with all empty space filled with > 0x0 bytes. ?That is then compressed with bzip2, then run > through rsbep (the one from the link above), then corrupted > with pockmark. ?Pockmark is my own little concoction, when used as > shown ?it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for > a run of (1-MAXRUN). ?In both cases the gap and run length are chosen at > random from those ranges for each new gap/run. > This should corrupt around 10% of the file, which I assumed would render > it unrecoverable. ?Notice in the file sizes below that the overall size > did not change when the file was run through pockmark. ?rsbep did not > note any errors it couldn't correct. However, the > size of the restored file is not the same as the orig. > > ?4056976560 2010-06-08 17:51 img.bz2.restored > ?4639143600 2010-06-08 16:19 img.bz2.rsbep.pox > ?4639143600 2010-06-08 16:13 img.bz2.rsbep > ?4056879025 2010-06-08 14:40 img.bz2.orig > 20974431744 2010-06-07 15:23 img.orig > > % bunzip2 -tvv img.bz2.restored > ?img.bz2.restored: > ? ?[1: huff+mtf data integrity (CRC) error in data > > So at the very least rsbep sometimes says it has recovered a file when > it has not. ?I didn't really expect it to rescue this particular input, > but it really should have handled it better. ? I reran it with a less > damaged file like this: > > > % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000 >>img.bz2.rsbep.pox2 > % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2 > rsbep: number of corrected failures ? : 46025036 > rsbep: number of uncorrectable blocks : 0 > % bunzip2 img.bz2.restored2 > bunzip2: Can't guess original name for img.bz2.restored2 -- using > img.bz2.restored2.out > bunzip2: img.bz2.restored2: trailing garbage after EOF ignored > % md5sum img.bz2.restored2.out img.orig > 7fbaec7143c3a17a31295a803641aa3c ?img.bz2.restored2.out > 7fbaec7143c3a17a31295a803641aa3c ?img.orig > > This time it was able to recover the corrupted file, but again, it > created an output file which was a different size. ?Is this always the > case? ? Seems to be at least for the size file used here: > > % cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2 > > nopox.bz2 is also 4056976560. ? The decoded output is always 97535 bytes > larger than the original, which may bear some relation to the > z=ERR_BURST_LEN parameter as: > > ?97535 /765 = 127.496732 > > which is suspiciously close to 255/2. ?Or that could just be a coincidence. > > In any case, bunzip2 was able to handle the crud on the end, but this > would have been a problem for other binary files. > > Tbe other thing that is frankly bizarre is the number of "corrected" > failures for the 2nd case vs. the first. ? ?The 2nd should have 10X > fewer bad bytes than the first, but the rsbep status messages > indicate 4.73X MORE. ?However, the number of bad bytes in the 2nd is > almost exactly 1%, as it should be. ?All of this suggests that rsbep > does not handle correctly files which are "too" corrupted. ?It gives the > wrong number of corrected blocks and thinks that it has corrected > everything when it has not done so. ?Worse, even when it does work the > output file was never (in any of the test cases) the same size as the > input file. > > I think this program has potential but it needs a bit of work to sand > the rough edges off. ?I will have a look at it, but won't have a chance > to do so for a couple of weeks. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > > > > -- What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph From eugen at leitl.org Mon Jun 14 06:22:09 2010 From: eugen at leitl.org (Eugen Leitl) Date: Mon, 14 Jun 2010 15:22:09 +0200 Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM Message-ID: <20100614132209.GG1964@leitl.org> http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From jlforrest at berkeley.edu Mon Jun 14 08:49:33 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 14 Jun 2010 08:49:33 -0700 Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea? Message-ID: <4C164F8D.60301@berkeley.edu> I have a cluster up and running that uses those SuperMicro Twin boxes, which have 2 nodes per rack unit, with each node using 2 AMD 6-core Istanbuls. This results in 12 cores per node, or 24 cores per rack unit. This is working fine. Now that the 12-core AMD processors are out, I was hoping that I could get the same configuration, except using 12-core processors, and yielding 48 cores per rack unit. The problem is, as of right now, I believe such boxes aren't available yet. The closest thing is a 4-way 1U box, which gives 48 cores per rack unit, but in *1 node*. My intuition tells me that I should be wary of such a configuration because of various SMP-related locking and concurrency issues. There probably aren't many single node 48 core boxes out there so there might be surprises. I don't like surprises. The obvious thing to do would be to wait until the Twin boxes come out but my problem is that I have money to spend that has to be spent soon, maybe before the Twin boxes come out. So, I'm trying to decide what to do. (I only want 1U boxes because I have to pay for rack space). Any advice? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From ntmoore at gmail.com Mon Jun 14 09:22:34 2010 From: ntmoore at gmail.com (Nathan Moore) Date: Mon, 14 Jun 2010 11:22:34 -0500 Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM In-Reply-To: References: <20100614132209.GG1964@leitl.org> Message-ID: I should have said "fancy ALU/FPU" no co-processor. Oh well. On Mon, Jun 14, 2010 at 11:21 AM, Nathan Moore wrote: > Sounds like a BlueGene without the fancy math co-processor and with a > more normal os... > > On Mon, Jun 14, 2010 at 8:22 AM, Eugen Leitl wrote: >> >> http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html >> >> >> >> -- >> Eugen* Leitl leitl http://leitl.org >> ______________________________________________________________ >> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org >> 8B29F6BE: 099D 78BA 2FD3 B014 B08A ?7779 75B0 2443 8B29 F6BE >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > - - - - - - - ? - - - - - - - ? - - - - - - - > Nathan Moore > Assistant Professor, Physics > Winona State University > AIM: nmoorewsu > - - - - - - - ? - - - - - - - ? - - - - - - - > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - From ntmoore at gmail.com Mon Jun 14 09:21:44 2010 From: ntmoore at gmail.com (Nathan Moore) Date: Mon, 14 Jun 2010 11:21:44 -0500 Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM In-Reply-To: <20100614132209.GG1964@leitl.org> References: <20100614132209.GG1964@leitl.org> Message-ID: Sounds like a BlueGene without the fancy math co-processor and with a more normal os... On Mon, Jun 14, 2010 at 8:22 AM, Eugen Leitl wrote: > > http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html > > > > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A ?7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - From jlforrest at berkeley.edu Mon Jun 14 10:51:04 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 14 Jun 2010 10:51:04 -0700 Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea? In-Reply-To: References: <4C164F8D.60301@berkeley.edu> Message-ID: <4C166C08.8070600@berkeley.edu> On 6/14/2010 10:29 AM, Mark Hahn wrote: >> right now, I believe such boxes aren't available >> yet. The closest thing is a 4-way 1U box, which >> gives 48 cores per rack unit, but in *1 node*. > > well, the supermicro website lists them: > http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HTRF.cfm > http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HIBQRF.cfm Those are both 2U boxes. I was hoping to find a 1U box since they cost less to co-locate. >> My intuition tells me that I should be wary of >> such a configuration because of various SMP-related >> locking and concurrency issues. > > why? is there something peculiar about your workload, and especially > something that would show up with modestly higher SMPness? Nothing specific. I'm worried about latency and locking issues that only pop out when larger numbers of cores are used. > this is hardly uncharted territory. SGI's been there forever, > and some fringe boxes from Intel. but 8s 4c has been pretty mundane > for a while, and doesn't need any sort of hand-holding. unless you > mean something like "I expect to swap a lot and want to configure a > single non-raid swap partition", I don't really see what you're worrying > about... SGI isn't mainstream, and probably doesn't use the same chipset and motherboards that SuperMicro will be selling. > I think people should actually take fresh look at 4s 1U boxes > because AMD has eliminated the "4-socket penalty". there are some > nontrivial advantages to fatter nodes - they let you achieve some > unique workload configurations (bigger memory, higher-threaded, etc). > sysadmin work doesn't scale linearly as the number of nodes, of course, > but having fewer, fatter nodes can be attractive TCO-wise, too. If SuperMicro doesn't come up with Twin boxes, I might be forced to follow your advice. I'm not concerned about sysadmin work, because I'm using Rocks. I'm more concerned about ending up in the Twilight Zone where things aren't as they appear. -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From hahn at mcmaster.ca Mon Jun 14 10:29:15 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 14 Jun 2010 13:29:15 -0400 (EDT) Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea? In-Reply-To: <4C164F8D.60301@berkeley.edu> References: <4C164F8D.60301@berkeley.edu> Message-ID: > right now, I believe such boxes aren't available > yet. The closest thing is a 4-way 1U box, which > gives 48 cores per rack unit, but in *1 node*. well, the supermicro website lists them: http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HTRF.cfm http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HIBQRF.cfm > My intuition tells me that I should be wary of > such a configuration because of various SMP-related > locking and concurrency issues. why? is there something peculiar about your workload, and especially something that would show up with modestly higher SMPness? > There probably aren't > many single node 48 core boxes out there so there > might be surprises. I don't like surprises. this is hardly uncharted territory. SGI's been there forever, and some fringe boxes from Intel. but 8s 4c has been pretty mundane for a while, and doesn't need any sort of hand-holding. unless you mean something like "I expect to swap a lot and want to configure a single non-raid swap partition", I don't really see what you're worrying about... > The obvious thing to do would be to wait until > the Twin boxes come out but my problem is that > I have money to spend that has to be spent soon, > maybe before the Twin boxes come out. So, I'm trying > to decide what to do. (I only want 1U boxes because > I have to pay for rack space). I think people should actually take fresh look at 4s 1U boxes because AMD has eliminated the "4-socket penalty". there are some nontrivial advantages to fatter nodes - they let you achieve some unique workload configurations (bigger memory, higher-threaded, etc). sysadmin work doesn't scale linearly as the number of nodes, of course, but having fewer, fatter nodes can be attractive TCO-wise, too. regards, mark hahn. From hearnsj at googlemail.com Tue Jun 15 02:30:40 2010 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 15 Jun 2010 10:30:40 +0100 Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea? In-Reply-To: <4C166C08.8070600@berkeley.edu> References: <4C164F8D.60301@berkeley.edu> <4C166C08.8070600@berkeley.edu> Message-ID: On 14 June 2010 18:51, Jon Forrest wrote: >> > SGI isn't mainstream, and probably doesn't use the > same chipset and motherboards that SuperMicro > will be selling. > Take a close, close look inside those ICE blade enclosures. You are of course correct though - IA64 type Altixes and Ultraviolten are definitely not mainstream! From eugen at leitl.org Tue Jun 15 02:41:55 2010 From: eugen at leitl.org (Eugen Leitl) Date: Tue, 15 Jun 2010 11:41:55 +0200 Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM In-Reply-To: <07FA933F-62AB-491C-80FF-C8C1D8906537@divination.biz> References: <20100614132209.GG1964@leitl.org> <07FA933F-62AB-491C-80FF-C8C1D8906537@divination.biz> Message-ID: <20100615094155.GU1964@leitl.org> On Mon, Jun 14, 2010 at 10:14:16PM -0400, Douglas J. Trainor wrote: > in Englisch: > > http://www.eweek.com/c/a/IT-Infrastructure/SeaMicro-Uses-Intel-Atom-Chip-in-Server-Architecture-745338/ > > Quote: "Feldman said the motherboard is shrunk from the size of a pizza box to that of a credit card." It's an interesting device. It would take >6 racks of those 300 EUR Supermicro Atom servers at a >200 kEUR price tag to match their 10 U. Though of course one could just use the 200 EUR motherboards on custom trays, and share the PSUs. Though one would then just stick with plain GBit Ethernet, and not their custom fabric. > douglas > > On Jun 14, 2010, at 9:22 AM, Eugen Leitl wrote: > > > > > http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html > > > > -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From ttsiodras at gmail.com Sat Jun 12 18:39:33 2010 From: ttsiodras at gmail.com (Thanassis Tsiodras) Date: Sun, 13 Jun 2010 04:39:33 +0300 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: I am the author of the updated "rsbep" package that was mentioned in this thread, and I was contacted by David Mathog (just "mathog" henceforth) about the issues he had. "mathog" reported a difference in the decoded output sizes, when directly using "rsbep" - but as mentioned by David N. Lombard, the usage scenario that "mathog" followed was not a sanctioned one: both my site's article as well as the package instructions (README) referred to the "freeze"/"melt" scripts, that decode the shielded data into the correct output size. The reason for what mathog experienced is a bit complex, but clearly explained on my site (http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html), and it boils down to this: in order to withstand storage errors, an interleaving of the Reed-Solomon encoded data has to take place. Basically, the x86 ASM code of Reed-Solomon that I "inherited" from the original "rsbep" and use in my package, adds 16 bytes of parity data to each block of 223 bytes of input, turning it to a 255-bytes block. These parity bytes allow detection and correction of 16 errors (in the encoded 255-byte block), as well as detection of 32 errors (in the encoded 255-byte block). This however won't work for storage media, since they work or fail on sector boundaries (512 bytes for disks and 2048 bytes for CDs/DVDs) - so the encoded data are interleaved by my package, inside blocks of 1040400 bytes (containing 4080 of the Reed-Solomon-encoded 255-byte blocks)... In this way, a loss of a sector only impacts ONE byte in the 512 encoded "blocks" that are passing through it (due to the interleaving)... If interested, you can read more details on my page, where I explain how the idea works. The end result, is that - the interleaved stream can lose 127 contiguous sectors (65024 contiguous bytes) and still be recoverable. - the interleaved stream can lose 128-255 sectors, and detect the error (and report it, but not fix it) - Beyond that number of errors (which correspond, after de-interleaving, to more than 32 bytes in the encoded 255 byte block), the Reed-Solomon code is lost... Given the interleaving that my package performs on the encoded bytes, the only chance of this happening, is losing a contiguous stream of more than 32x4080 bytes, i.e. 130560 bytes. A storage error that causes this much loss (255 contiguous sectors!) is a lost cause anyway - at least as far as my needs go. If you want to be able to recover from this or even larger amounts of loss, you can do it, by increasing the block size from my chosen 255x4080 (1040400 bytes) to something even bigger, and by adapting my interleaving code (rsbep.c, "distribute" function). To summarize, "mathog"'s pockmark app is not representative of what happens in storage media - they NEVER fail on byte-levels - they fail on sector levels. So what should you do, if you want to be 100% sure of failure detection? Simple: By reviewing my freeze/melt scripts, you will see that all I do to the "to-be-shielded-stream" is (a) add a "magic marker" and (b) add the file size, so that "melt.sh" can chop the output down to the right size. If you want bullet-proof checks, you can easily add the MD5 or SHA sum of the input data, to the "to-be-shielded-stream", so that the "melting" process can check this and be 100% certain in restoration or detection of failure, even in the face of impossible stream corruption (more than 130K lost). Note however, that this is not necessary if you use an algorithm that can detect errors in the decoded stream (which is how I use my rsbep - i.e always on a stream generated by gzip, bzip2, etc) Hope this clarifies things. Kind regards, Thanassis Tsiodras, Dr.-Ing. -- What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph From ttsiodras at gmail.com Sun Jun 13 00:19:56 2010 From: ttsiodras at gmail.com (Thanassis Tsiodras) Date: Sun, 13 Jun 2010 10:19:56 +0300 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: I am the author of the updated "rsbep" package that was mentioned in this thread, and I was contacted by David Mathog (just "mathog" henceforth) about the issues he had. "mathog" reported a difference in the decoded output sizes, when directly using "rsbep" - but as mentioned by David N. Lombard, the usage scenario that "mathog" followed was not a sanctioned one: both my site's article as well as the package instructions (README) referred to the "freeze"/"melt" scripts, that decode the shielded data into the correct output size. The reason for what mathog experienced is a bit complex, but clearly explained on my site (http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html), and it boils down to this: in order to withstand storage errors, an interleaving of the Reed-Solomon encoded data has to take place. Basically, the x86 ASM code of Reed-Solomon that I "inherited" from the original "rsbep" and use in my package, adds 16 bytes of parity data to each block of 223 bytes of input, turning it to a 255-bytes block. These parity bytes allow detection and correction of 16 errors (in the encoded 255-byte block), as well as detection of 32 errors (in the encoded 255-byte block). This however won't work for storage media, since they work or fail on sector boundaries (512 bytes for disks and 2048 bytes for CDs/DVDs) - so the encoded data are interleaved by my package, inside blocks of 1040400 bytes (containing 4080 of the Reed-Solomon-encoded 255-byte blocks)... In this way, a loss of a sector only impacts ONE byte in the 512 encoded "blocks" that are passing through it (due to the interleaving)... If interested, you can read more details on my page, where I explain how the idea works. The end result, is that - the interleaved stream can lose 127 contiguous sectors (65024 contiguous bytes) and still be recoverable. - the interleaved stream can lose 128-255 sectors, and detect the error (and report it, but not fix it) - Beyond that number of errors (which correspond, after de-interleaving, to more than 32 bytes in the encoded 255 byte block), the Reed-Solomon code is lost... Given the interleaving that my package performs on the encoded bytes, the only chance of this happening, is losing a contiguous stream of more than 32x4080 bytes, i.e. 130560 bytes. A storage error that causes this much loss (255 contiguous sectors!) is a lost cause anyway - at least as far as my needs go. If you want to be able to recover from this or even larger amounts of loss, you can do it, by increasing the block size from my chosen 255x4080 (1040400 bytes) to something even bigger, and by adapting my interleaving code (rsbep.c, "distribute" function). To summarize, "mathog"'s pockmark app is not representative of what happens in storage media - they NEVER fail on byte-levels - they fail on sector levels. So what should you do, if you want to be 100% sure of failure detection? Simple: By reviewing my freeze/melt scripts, you will see that all I do to the "to-be-shielded-stream" is (a) add a "magic marker" and (b) add the file size, so that "melt.sh" can chop the output down to the right size. If you want bullet-proof validity checks, you can easily add the MD5 or SHA sum of the input data, to the "to-be-shielded-stream", so that the "melting" process can check this and be 100% certain in restoration or detection of failure, even in the face of impossible stream corruption (more than 130K lost). Note however, that this is not necessary if you use an algorithm that can detect errors in the decoded stream (which is how I use my rsbep - i.e always on a stream generated by gzip, bzip2, etc) Hope this clarifies things. Kind regards, Thanassis Tsiodras, Dr.-Ing. -- What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph From ttsiodras at gmail.com Sun Jun 13 13:16:35 2010 From: ttsiodras at gmail.com (Thanassis Tsiodras) Date: Sun, 13 Jun 2010 23:16:35 +0300 Subject: [Beowulf] OT: recoverable optical media archive format? Message-ID: The corruption of rsbep-protected data with GCC versions later than 4.3 has now been fixed. I reviewed the code with valgrind, and it turned out that the code I "inherited" from the original rsbep package performed out of bounds read accesses in the "distribute" function... This was not in the "core" Reed-solomon code, only in the interleaving function. I re-wrote it - check rsbep.c, line 68 in the latest tarball. I also did a clean-up of useless code that was never called. The results work perfectly under all GCC versions I tried, under Debian/32, Arch/32 and FreeBSD/64. Give it a try: http://users.softlab.ntua.gr/~ttsiod/rsbep-0.1.0-ttsiodras.tar.bz2 So, to summarize, my response points on the original thread: - The errors reported by David Mathog on the list had to do with erroneous usage of the tools - either the "freeze"/"melt" scripts must be used, or "chopping" of the output has to be done via your own custom code. - The memory read access errors were fixed, so the tool works fine with all GCC versions under all OSes. - David's "pockmark" app is not representative of what happens in storage media - they don't fail on byte-boundaries - they fail on sector boundaries. My small contributions (freeze/melt scripts) on rsbep make sure that even if we lose 127 contiguous 512-byte sectors, we can still recover the data, at the exact original size. - If you want bullet-proof checks, you can easily add the MD5 or SHA sum of the input data, to the "to-be-shielded-stream", so that the "melting" can be 100% certain of successful in detecting successful restoration or data. This, however, is not necessary if you use an algorithm that can detect errors in the decoded stream (gzip, bzip2, etc) - One final comment about PAR, which was suggested by others: since it was designed for newsgroups, its recovery capabilities had other (non-storage related) scope - for an "executive summary" read the last of the comments I received when my rsbep became Slashdotted, here: http://hardware.slashdot.org/hardware/08/08/03/197254.shtml -- What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph From andreas.de-blanche at hv.se Mon Jun 14 00:32:18 2010 From: andreas.de-blanche at hv.se (Andreas de Blanche) Date: Mon, 14 Jun 2010 09:32:18 +0200 Subject: [Beowulf] Survey how industrial companies use their HPC Resources In-Reply-To: References: Message-ID: <4C15F722.7F5B.0055.1@hv.se> Dear all, I am sending out this survey on behalf of a master student of mine, he would very much appreciate if you took his survey. http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404 Best Regards //Andreas de Blanche ******************************************************************* I'm Wei He, a master student who study Computer Science at University West in Sweden. I'm doing my thesis on investigation of how industrial companies use their High performance computing resources. I'm performing this study with my two supervisors Dr. Linn Christiernin and Andreas de Blanche. I know you have a wide range of experience in the field. It will take approximately 6-7 minutes to solve, please finish it before June 30th. Your help in completing this questionnaire is very much appreciated. The data collected is solely for research purpose. Thank you for participating in this questionnaire survey. If you have two Computing Resource please take this survey twice. http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404 Yours Sincearly Wei He Master Student at University West, Sweden ******************************************************************* From deadline at eadline.org Tue Jun 15 08:31:57 2010 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 15 Jun 2010 11:31:57 -0400 (EDT) Subject: [Beowulf] Survey how industrial companies use their HPC Resources In-Reply-To: <4C15F722.7F5B.0055.1@hv.se> References: <4C15F722.7F5B.0055.1@hv.se> Message-ID: <45740.192.168.1.213.1276615917.squirrel@mail.eadline.org> Will the results be publicly available? -- Doug > Dear all, > I am sending out this survey on behalf of a master student of mine, he > would very much appreciate if you took his survey. > http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404 > > > Best Regards > //Andreas de Blanche > > ******************************************************************* > I'm Wei He, a master student who study Computer Science at University > West in Sweden. I'm doing my thesis on > investigation of how industrial companies use their High performance > computing resources. I'm performing this > study with my two supervisors Dr. Linn Christiernin and Andreas de > Blanche. I know you have a wide range of > experience in the field. > It will take approximately 6-7 minutes to solve, please finish it > before June 30th. Your help in completing this > questionnaire is very much appreciated. The data collected is solely > for research purpose. > Thank you for participating in this questionnaire survey. > If you have two Computing Resource please take this survey twice. > > http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404 > > > > Yours Sincearly > Wei He > Master Student at University West, Sweden > ******************************************************************* > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From lindahl at pbm.com Tue Jun 15 13:12:26 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 15 Jun 2010 13:12:26 -0700 Subject: [Beowulf] Survey how industrial companies use their HPC Resources In-Reply-To: <45740.192.168.1.213.1276615917.squirrel@mail.eadline.org> References: <4C15F722.7F5B.0055.1@hv.se> <45740.192.168.1.213.1276615917.squirrel@mail.eadline.org> Message-ID: <20100615201226.GF21791@bx9.net> On Tue, Jun 15, 2010 at 11:31:57AM -0400, Douglas Eadline wrote: > Will the results be publicly available? At the end of the survey, you can put in an email address to receive a copy of his finished thesis. I noticed that the country list doesn't include the USA. -- greg From buccaneer at rocketmail.com Wed Jun 16 17:08:16 2010 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed, 16 Jun 2010 17:08:16 -0700 (PDT) Subject: [Beowulf] Survey how industrial companies use their HPC Resources In-Reply-To: <20100615201226.GF21791@bx9.net> Message-ID: <814583.50541.qm@web30602.mail.mud.yahoo.com> --- On Tue, 6/15/10, Greg Lindahl wrote: > From: Greg Lindahl > > I noticed that the country list doesn't include the USA. > I just made the assumption if you left it blank that it was understood. :) From rigved.sharma123 at gmail.com Wed Jun 16 18:47:18 2010 From: rigved.sharma123 at gmail.com (rigved sharma) Date: Thu, 17 Jun 2010 07:17:18 +0530 Subject: [Beowulf] tracejob command shows error Message-ID: hi we are having cluster of 16 nodes and torque and maui installed on it. we have just migrated from torque 2.3.6 to torque 2.4.8.but tracejob command is not working. /usr/spool/PBS/server_priv/accounting/20100617: No matching job records located /usr/spool/PBS/server_logs/20100617: No matching job records located /usr/spool/PBS/mom_logs/20100617: No such file or directory /usr/spool/PBS/sched_logs/20100617: No such file or directory *** glibc detected *** tracejob: malloc(): memory corruption: 0x0000000003a74140 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3845871cd1] /lib64/libc.so.6(__libc_malloc+0x7d)[0x3845872e8d] /lib64/libc.so.6(popen+0x23)[0x3845862a63] tracejob[0x401218] tracejob[0x401bcf] /lib64/libc.so.6(__libc_start_main+0xf4)[0x384581d8b4] tracejob[0x400e09] ======= Memory map: ======== 00400000-00403000 r-xp 00000000 68:05 6094878 /usr/spool/PBS/bin/tracejob 00603000-00604000 rw-p 00003000 68:05 6094878 /usr/spool/PBS/bin/tracejob 03a74000-03a95000 rw-p 03a74000 00:00 0 3844800000-384481a000 r-xp 00000000 68:02 896307 /lib64/ld-2.5.so 3844a1a000-3844a1b000 r--p 0001a000 68:02 896307 /lib64/ld-2.5.so 3844a1b000-3844a1c000 rw-p 0001b000 68:02 896307 /lib64/ld-2.5.so 3845800000-384594a000 r-xp 00000000 68:02 896308 /lib64/libc-2.5.so 384594a000-3845b49000 ---p 0014a000 68:02 896308 /lib64/libc-2.5.so 3845b49000-3845b4d000 r--p 00149000 68:02 896308 /lib64/libc-2.5.so 3845b4d000-3845b4e000 rw-p 0014d000 68:02 896308 /lib64/libc-2.5.so 3845b4e000-3845b53000 rw-p 3845b4e000 00:00 0 3849c00000-3849c0d000 r-xp 00000000 68:02 896314 /lib64/libgcc_s-4.1.2-20080102.so.1 3849c0d000-3849e0d000 ---p 0000d000 68:02 896314 /lib64/libgcc_s-4.1.2-20080102.so.1 3849e0d000-3849e0e000 rw-p 0000d000 68:02 896314 /lib64/libgcc_s-4.1.2-20080102.so.1 2abb01fde000-2abb01fe0000 rw-p 2abb01fde000 00:00 0 2abb01fe0000-2abb02009000 r-xp 00000000 68:05 6072097 /usr/spool/PBS/lib/libtorque.so.2.0.0 2abb02009000-2abb02209000 ---p 00029000 68:05 6072097 /usr/spool/PBS/lib/libtorque.so.2.0.0 2abb02209000-2abb0220b000 rw-p 00029000 68:05 6072097 /usr/spool/PBS/lib/libtorque.so.2.0.0 2abb0220b000-2abb022ee000 rw-p 2abb0220b000 00:00 0 2abb0231a000-2abb0231b000 rw-p 2abb0231a000 00:00 0 2abb04000000-2abb04021000 rw-p 2abb04000000 00:00 0 2abb04021000-2abb08000000 ---p 2abb04021000 00:00 0 7fffa8ab6000-7fffa8acc000 rw-p 7fffa8ab6000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] Aborted. kindly suggest. -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel at unimelb.edu.au Wed Jun 16 22:05:33 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Thu, 17 Jun 2010 15:05:33 +1000 Subject: [Beowulf] tracejob command shows error In-Reply-To: References: Message-ID: On Thu, 17 Jun 2010 11:47:18 am rigved sharma wrote: > *** glibc detected *** tracejob: malloc(): memory corruption: > 0x0000000003a74140 *** There was a bug reported that looked similar (though 32-bit) here: http://www.clusterresources.com/bugzilla/show_bug.cgi?id=49 Can I ask you to get a bugzilla account and add your problem to that one please ? The original reporter got rid of his install before we could track the problem down. For the record we're running 2.4.8 at VLSCI on CentOS 5.4 (x86-64) without any issues. cheers, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From rchang.lists at gmail.com Thu Jun 17 09:31:18 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Thu, 17 Jun 2010 22:01:18 +0530 Subject: [Beowulf] TEST - Pls ignore Message-ID: <4C1A4DD6.1030905@gmail.com> TEST From Craig.Tierney at noaa.gov Thu Jun 17 13:59:13 2010 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Thu, 17 Jun 2010 14:59:13 -0600 Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <4C1A4DD6.1030905@gmail.com> References: <4C1A4DD6.1030905@gmail.com> Message-ID: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> I am looking for a little help to find out what block sizes (as shown by stat) by Linux based parallel filesystems. You can find this by running stat on a file. For example on Lustre: # stat /lfs0/bigfile File: `/lfs0//bigfile' Size: 1073741824 Blocks: 2097160 IO Block: 2097152 regular file Device: 59924a4a8h/1502839976d Inode: 45361266 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-06-17 20:24:32.000000000 +0000 Modify: 2010-06-17 20:16:49.000000000 +0000 Change: 2010-06-17 20:16:49.000000000 +0000 If anyone can run this test and provide me with the filesystem and result (as well as the OS used), it would be a big help. I am specifically looking for GPFS results, but other products (Panasas, GlusterFS, NetApp GX) would be helpful. Why do I care? Because in netcdf, when nf_open or nf_create are called, it will use the blocksize that is found in the stat structure. On lustre it is 2M so writes are very fast. However, if the number comes back as 4k (which some filesystems do), then writes are slower than they need to be. This isn't just a netcdf issue. The Linux tool cp does the same thing, it will use a block size that matches the specified blocksize of the destination filesystem. Thanks, Craig From jlb17 at duke.edu Thu Jun 17 14:35:47 2010 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Thu, 17 Jun 2010 17:35:47 -0400 (EDT) Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> References: <4C1A4DD6.1030905@gmail.com> <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> Message-ID: On Thu, 17 Jun 2010 at 2:59pm, Craig Tierney wrote > I am looking for a little help to find out what block sizes (as shown > by stat) by Linux based parallel filesystems. > > You can find this by running stat on a file. For example on Lustre: > > # stat /lfs0/bigfile > File: `/lfs0//bigfile' > Size: 1073741824 Blocks: 2097160 IO Block: 2097152 regular file > Device: 59924a4a8h/1502839976d Inode: 45361266 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-06-17 20:24:32.000000000 +0000 > Modify: 2010-06-17 20:16:49.000000000 +0000 > Change: 2010-06-17 20:16:49.000000000 +0000 > > If anyone can run this test and provide me with the filesystem > and result (as well as the OS used), it would be a big help. I am > specifically looking for GPFS results, but other products (Panasas, > GlusterFS, NetApp GX) would be helpful. GlusterFS 3.0.4 on CentOS-5: stat pdball.pir File: `pdball.pir' Size: 155471981 Blocks: 303984 IO Block: 4096 regular file Device: 21h/33d Inode: 205792080 Links: 1 Access: (0644/-rw-r--r--) Uid: (11805/database) Gid: (11805/database) Access: 2010-06-10 16:55:43.000000000 -0700 Modify: 2010-06-10 06:03:53.000000000 -0700 Change: 2010-06-10 23:36:14.000000000 -0700 -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From samuel at unimelb.edu.au Thu Jun 17 16:58:50 2010 From: samuel at unimelb.edu.au (Christopher Samuel) Date: Fri, 18 Jun 2010 09:58:50 +1000 Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> References: <4C1A4DD6.1030905@gmail.com> <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 18/06/10 06:59, Craig Tierney wrote: > If anyone can run this test and provide me with the > filesystem and result (as well as the OS used), it > would be a big help. I am specifically looking for > GPFS results, but other products (Panasas, GlusterFS, > NetApp GX) would be helpful. Our Panasas system says: File: `hwloc-1.0.1rc1.tar.bz2' Size: 1855126 Blocks: 4256 IO Block: 4096 regular file Device: 18h/24d Inode: 283485546 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 500/ samuel) Gid: ( 506/ vlsci) Access: 2010-06-03 11:29:54.613258023 +1000 Modify: 2010-05-29 01:24:56.000000000 +1000 Change: 2010-06-03 11:29:54.613258023 +1000 cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwatroACgkQO2KABBYQAh+YCgCeIPo18p9HtjAVn9O5R89Xhm9b dRIAnROHlQj3WEeM6AbrSZ0fmiej7vBv =flxS -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Thu Jun 17 18:29:44 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 17 Jun 2010 18:29:44 -0700 Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> References: <4C1A4DD6.1030905@gmail.com> <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> Message-ID: <20100618012944.GD32585@bx9.net> On Thu, Jun 17, 2010 at 02:59:13PM -0600, Craig Tierney wrote: > Why do I care? Because in netcdf, when nf_open or nf_create are > called, it will use the blocksize that is found in the stat structure. On > lustre it is 2M so writes are very fast. However, if the number comes > back as 4k (which some filesystems do), then writes are slower than > they need to be. This isn't just a netcdf issue. The Linux tool cp does > the same thing, it will use a block size that matches the specified > blocksize of the destination filesystem. Craig, On-node filesystems merge writes in the guts of the block device system, so I wouldn't be surprised if 4k buffers and 2M buffers were about the same with ext3. To get an idea if this is the case with parallel filesystems, if people could measure the speed of dd with various blocksizes, that would tell you a better answer than just the blocksize. But, of course, you will run into the usual issue of write buffering. -- greg From kilian.cavalotti.work at gmail.com Fri Jun 18 01:08:36 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Fri, 18 Jun 2010 10:08:36 +0200 Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> References: <4C1A4DD6.1030905@gmail.com> <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> Message-ID: Craig, On Thu, Jun 17, 2010 at 10:59 PM, Craig Tierney wrote: > If anyone can run this test and provide me with the filesystem > and result (as well as the OS used), it would be a big help. ?I am > specifically looking for GPFS results, but other products (Panasas, > GlusterFS, NetApp GX) would be helpful. GPFS 3.3 on RHEL5.5: # stat /gpfs/bigfile File: `/gpfs/bigfile' Size: 10737418240 Blocks: 20971520 IO Block: 1048576 regular file Device: 13h/19d Inode: 127073 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2010-06-18 10:04:10.555613000 +0200 Modify: 2010-06-18 10:04:58.764714000 +0200 Change: 2010-06-18 10:04:58.764714000 +0200 But block size is really something you can choose when creating the filesystem (mmcrfs -B). Cheers, -- Kilian From john.hearns at mclaren.com Fri Jun 18 04:30:12 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 18 Jun 2010 12:30:12 +0100 Subject: [Beowulf] Turboboost/IDA on Nehalem Message-ID: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> Does anyone know much about Turboboost on Nehalem? I would like to have some indication that this is working, and perhaps measure what effect it has. I have enabled Turboboost in the BIOS, however when I modprobe acpi_cpufreq I get FATAL: Error inserting acpi_cpufreq (/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a cpi-cpufreq.ko): No such device Dave Jones' blog suggest this is a BIOS issue. Any ideas? I have tried BIOS flags for hardware and software control of CPU stepping. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From douglas.guptill at dal.ca Fri Jun 18 05:27:20 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 18 Jun 2010 09:27:20 -0300 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20100618122720.GB1373@sopalepc> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote: > Does anyone know much about Turboboost on Nehalem? > I would like to have some indication that this is working, and perhaps > measure what effect it has. > I have enabled Turboboost in the BIOS, however when I modprobe > acpi_cpufreq I get > > FATAL: Error inserting acpi_cpufreq > (/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a > cpi-cpufreq.ko): No such device > > > Dave Jones' blog suggest this is a BIOS issue. Any ideas? > I have tried BIOS flags for hardware and software control of CPU > stepping. Here is what I get: scrum:~# uname -a Linux scrum 2.6.26-2-amd64 #1 SMP Wed May 12 18:03:14 UTC 2010 x86_64 GNU/Linux scrum:~# modprobe -nv acpi_cpufreq insmod /lib/modules/2.6.26-2-amd64/kernel/drivers/cpufreq/freq_table.ko insmod /lib/modules/2.6.26-2-amd64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpu scrum:~# modprobe acpi_cpufreq scrum:~# Which seems to have worked. HTH, Douglas. P.S. `less /proc/cpuinfo` starts with: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz stepping : 5 cpu MHz : 2668.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida bogomips : 5349.70 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: -- Douglas Guptill voice: 902-461-9749 Research Assistant, LSC 4640 email: douglas.guptill at dal.ca Oceanography Department fax: 902-494-3877 Dalhousie University Halifax, NS, B3H 4J1, Canada From cap at nsc.liu.se Fri Jun 18 07:41:22 2010 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Fri, 18 Jun 2010 16:41:22 +0200 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> Message-ID: <201006181641.26945.cap@nsc.liu.se> On Friday 18 June 2010, Hearns, John wrote: > Does anyone know much about Turboboost on Nehalem? > I would like to have some indication that this is working, and perhaps > measure what effect it has. > I have enabled Turboboost in the BIOS, however when I modprobe > acpi_cpufreq I get > > FATAL: Error inserting acpi_cpufreq > (/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a > cpi-cpufreq.ko): No such device On stock CentOS-5 it looks like this (with dual E5520): [root at m1 ~]# uname -r 2.6.18-194.3.1.el5 [root at m1 ~]# /etc/init.d/cpuspeed start Enabling ondemand cpu frequency scaling: [ OK ] [root at m1 ~]# lsmod | grep acpi_cpufreq acpi_cpufreq 47937 0 [root at m1 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 2268000 2267000 2133000 2000000 1867000 1733000 1600000 Turbo-boost is the +1MHz freq 2268000. When the governor goes looking for the highest available frequency the CPU goes into turbo-mode. The actual frequency then is determined by available power and thermal margins but ultimately limited depending on processor model (my E5520 can do one freq. step, that is, 2.26 -> 2.4 GHz. X55xx can do two). In my experience (E5520 and X5550) turbo mode works and gives you extra performance even when using all cores (HPC load). However, power consumption typically goes up a lot. /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From eagles051387 at gmail.com Fri Jun 18 22:32:29 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Sat, 19 Jun 2010 07:32:29 +0200 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: <201006181641.26945.cap@nsc.liu.se> References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <201006181641.26945.cap@nsc.liu.se> Message-ID: if im not mistaken the increase is about 600mHz regardless of the i7 model. feel free to correct me if im wrong -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Sat Jun 19 08:24:13 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 19 Jun 2010 10:24:13 -0500 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: <20100618122720.GB1373@sopalepc> References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <20100618122720.GB1373@sopalepc> Message-ID: On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill wrote: > On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote: >> Does anyone know much about Turboboost on Nehalem? >> I would like to have some indication that this is working, and perhaps >> measure what effect it has. >> I have enabled Turboboost in the BIOS, however when I modprobe >> acpi_cpufreq I get What's a good way to confirm if my procs are actually in a turbo state at a given point of time. It doesn't get reported back through the usual BIOS channels does it? -- Rahul From hahn at mcmaster.ca Sun Jun 20 13:43:42 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun, 20 Jun 2010 16:43:42 -0400 (EDT) Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <201006181641.26945.cap@nsc.liu.se> Message-ID: > if im not mistaken the increase is about 600mHz regardless of the i7 model. > feel free to correct me if im wrong no, it varies by model. From eagles051387 at gmail.com Sun Jun 20 14:02:48 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Sun, 20 Jun 2010 23:02:48 +0200 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <201006181641.26945.cap@nsc.liu.se> Message-ID: whats the range that it varies by? -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Sun Jun 20 14:09:16 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun, 20 Jun 2010 17:09:16 -0400 (EDT) Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <201006181641.26945.cap@nsc.liu.se> Message-ID: > whats the range that it varies by? http://ark.intel.com/MySearch.aspx?TBT=true but seriously, you could have found this with 3 clicks or so. From jamesb at loreland.org Mon Jun 21 02:37:43 2010 From: jamesb at loreland.org (James Braid) Date: Mon, 21 Jun 2010 10:37:43 +0100 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <20100618122720.GB1373@sopalepc> Message-ID: On Sat, Jun 19, 2010 at 16:24, Rahul Nabar wrote: > On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill wrote: >> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote: >>> Does anyone know much about Turboboost on Nehalem? >>> I would like to have some indication that this is working, and perhaps >>> measure what effect it has. >>> I have enabled Turboboost in the BIOS, however when I modprobe >>> acpi_cpufreq I get > > What's a good way to confirm if ?my procs are actually in a turbo > state at a given point of time. It doesn't get reported back through > the usual BIOS channels does it? turbostat from pmtools is a great little tool for seeing this type of info. http://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ FWIW acpi_cpufreq doesn't load on any of our Nehalem based systems (HP, Dell, running on Fedora 12) - but turbo boost and all the other power efficiency features work fine as far as I can tell. From bs_lists at aakef.fastmail.fm Thu Jun 17 16:08:35 2010 From: bs_lists at aakef.fastmail.fm (Bernd Schubert) Date: Fri, 18 Jun 2010 01:08:35 +0200 Subject: [Beowulf] Looking for block size settings (from stat) on parallel filesystems In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> References: <4C1A4DD6.1030905@gmail.com> <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov> Message-ID: <201006180108.35316.bs_lists@aakef.fastmail.fm> On Thursday 17 June 2010, Craig Tierney wrote: > I am looking for a little help to find out what block sizes (as shown > by stat) by Linux based parallel filesystems. > > You can find this by running stat on a file. For example on Lustre: > > # stat /lfs0/bigfile > File: `/lfs0//bigfile' > Size: 1073741824 Blocks: 2097160 IO Block: 2097152 regular file > Device: 59924a4a8h/1502839976d Inode: 45361266 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2010-06-17 20:24:32.000000000 +0000 > Modify: 2010-06-17 20:16:49.000000000 +0000 > Change: 2010-06-17 20:16:49.000000000 +0000 > > If anyone can run this test and provide me with the filesystem > and result (as well as the OS used), it would be a big help. I am > specifically looking for GPFS results, but other products (Panasas, > GlusterFS, NetApp GX) would be helpful. > > Why do I care? Because in netcdf, when nf_open or nf_create are > called, it will use the blocksize that is found in the stat structure. On > lustre it is 2M so writes are very fast. However, if the number comes > back as 4k (which some filesystems do), then writes are slower than > they need to be. This isn't just a netcdf issue. The Linux tool cp does > the same thing, it will use a block size that matches the specified > blocksize of the destination filesystem. Probably a bit hackish, but it would be very simple to write an overlay fuse filesystem, which would allow to modify that parameter. Unfortunately, we also would need to modify fuse, as current maximum through fuse are 128KB. Although it also would be easy to change those defines. However, I'm not sure if RedHat backported those patches to allow large IO sizes through fuse at all. If not, glusterfs on RedHat also only will send 4KB requests. Cheers, Bernd From robl at mcs.anl.gov Fri Jun 18 13:29:37 2010 From: robl at mcs.anl.gov (Rob Latham) Date: Fri, 18 Jun 2010 15:29:37 -0500 Subject: [Beowulf] [hpc-announce] Deadline extended: CFP: Workshop on Interfaces and Abstractions for Scientific Data Storage (IASDS 2010) Message-ID: <20100618202937.GJ4073@mcs.anl.gov> We have extended the deadline for submission to IASDS 2010 by one week CALL FOR PAPERS: IASDS 2010 (http://www.mcs.anl.gov/events/workshops/iasds10/) In conjunction with IEEE Cluster 2010 (http://www.cluster2010.org/) High-performance computing simulations and large scientific experiments such as those in high energy physics generate tens of terabytes of data, and these data sizes grow each year. Existing systems for storing, managing, and analyzing data are being pushed to their limits by these applications, and new techniques are necessary to enable efficient data processing for future simulations and experiments. This workshop will provide a forum for engineers and scientists to present and discuss their most recent work related to the storage, management, and analysis of data for scientific workloads. Emphasis will be placed on forward-looking approaches to tackle the challenges of storage at extreme scale or to provide better abstractions for use in scientific workloads. TOPICS OF INTEREST: Topics of interest include, but are not limited to: - parallel file systems - scientific databases - active storage - scientific I/O middleware - extreme scale storage PAPER SUBMISSION Workshop papers will be peer-reviewed and will appear as part of the IEEE Cluster 2010 proceedings. Submissions must follow the Cluster 2010 format: PDF files only. Maximum 10 pages. Single-spaced 8.5x11-inch, Two-column numbered pages in IEEE Xplore format IMPORTANT DATES: Paper Submission Deadline: now June 28, 2010 Author Notification: July 16, 2010 Final Manuscript: July 30, 2010 Workshop: September 24, 2010 PROGRAM COMMITTEE: Program Committee Robert Latham, Argonne National Laboratory Quincey Koziol, The HDF Group Pete Wyckoff, Netapp Wei-Keng Liao, Northwestern University Florin Isalia, Universidad Carlos III de Madrid Katie Antypas, NERSC Anshu Dubey, FLASH Dean Hildebrand, IBM Almaden Bradley Settlemyer, Oak Ridge National Laboratory -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA From turnerg at indiana.edu Sat Jun 19 09:37:48 2010 From: turnerg at indiana.edu (George Wm Turner) Date: Sat, 19 Jun 2010 12:37:48 -0400 Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <20100618122720.GB1373@sopalepc> Message-ID: <365A08AC-E47F-4F10-A5A1-D9E5BA9AB5E0@indiana.edu> cat /proc/cpuinfo Look for a clock rate higher then the chip's rated clock speed. You must have cpuspeed enabled (re: redhat: service cpuspeed start) On multi core chips, turbomode comes into play when the chip is lighty loaded and the idle cores can be clocked down and that power divereted to the core(s) actually running code. On an idle system, you may notice that all the cpus in /proc/cpuinfo" say they're running at the higher clock speeds; it's an illusion; they ain't doin' nuttin. george wm turner high performance systems 812 855 5156 On Jun 19, 2010, at 11:24 AM, Rahul Nabar wrote: > On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill > wrote: >> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote: >>> Does anyone know much about Turboboost on Nehalem? >>> I would like to have some indication that this is working, and >>> perhaps >>> measure what effect it has. >>> I have enabled Turboboost in the BIOS, however when I modprobe >>> acpi_cpufreq I get > > What's a good way to confirm if my procs are actually in a turbo > state at a given point of time. It doesn't get reported back through > the usual BIOS channels does it? > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From K.Weiss at science-computing.de Mon Jun 21 04:00:16 2010 From: K.Weiss at science-computing.de (Karsten Weiss) Date: Mon, 21 Jun 2010 13:00:16 +0200 (CEST) Subject: [Beowulf] Turboboost/IDA on Nehalem In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com> <20100618122720.GB1373@sopalepc> Message-ID: On Mon, 21 Jun 2010, James Braid wrote: > > What's a good way to confirm if ?my procs are actually in a turbo > > state at a given point of time. It doesn't get reported back through > > the usual BIOS channels does it? > > turbostat from pmtools is a great little tool for seeing this type of info. > > http://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ > > FWIW acpi_cpufreq doesn't load on any of our Nehalem based systems > (HP, Dell, running on Fedora 12) - but turbo boost and all the other > power efficiency features work fine as far as I can tell. Another alternative is i7z: http://code.google.com/p/i7z/ -- Karsten Weiss / science + computing ag From prentice at ias.edu Fri Jun 25 07:28:15 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 25 Jun 2010 10:28:15 -0400 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 Message-ID: <4C24BCFF.1040007@ias.edu> Beowulfers, One of my Fortran programmers had to increase the precision of his program so he switched from REAL*8 to REAL*16 which changes the size of his variables from 64 bits to 128 bits. The program now takes 32x longer to run. I'm not an expert on processor archtitecture, etc., but I do know that once the size of a variable exceeds the size of the processors registers, things will slow down considerably. Is his 32x performance degradation in line with this? Is there any way to reduce this degradation? Would The GNU GMP library (or some other library) help speed things up? -- Prentice From landman at scalableinformatics.com Fri Jun 25 07:51:52 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 25 Jun 2010 10:51:52 -0400 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <4C24BCFF.1040007@ias.edu> References: <4C24BCFF.1040007@ias.edu> Message-ID: <4C24C288.5090604@scalableinformatics.com> Prentice Bisbal wrote: > Beowulfers, > > One of my Fortran programmers had to increase the precision of his > program so he switched from REAL*8 to REAL*16 which changes the size of > his variables from 64 bits to 128 bits. The program now takes 32x longer > to run. > > I'm not an expert on processor archtitecture, etc., but I do know that > once the size of a variable exceeds the size of the processors > registers, things will slow down considerably. Is his 32x performance > degradation in line with this? At least 4x more work is often the case. 32x doesn't sound unreasonable. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From prentice at ias.edu Fri Jun 25 07:53:11 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 25 Jun 2010 10:53:11 -0400 Subject: [Beowulf] Thread hijacking In-Reply-To: <4B352F20504D274FBEB616B2247487F7C25C15@exchange2003.highdata.com> References: <4C24BCFF.1040007@ias.edu> <4B352F20504D274FBEB616B2247487F7C25C15@exchange2003.highdata.com> Message-ID: <4C24C2D7.20003@ias.edu> Jaime, If you're going to post something like this to our mailing list, please do not reply to someone else's email on a different topic. This is known as thread hijacking, and screws up the flow of the conversation, especially in the mailing list archives. It's bad etiquette. Since this position is for someone with cluster skills, it is appropriate to post it here - as it's own thread. Prentice Jaime Biggs wrote: > Hello, > > I was wondering if anyone had any interest in this opportunity? If not > we can pay you $1,000 if we can place the person you refer to me. > > > > Location: Cambridge, MA > Duration: 3-6 months possible long term > Senior Support Analyst and Beowulf Cluster Administrator > > From Manager: > > This is the profile I need. Will be working on a Centrfy project (synchs > unix accounts with Active Directory), completing the set up of the new > linux cluster using Cluster File Services for connection to the storage > appliance and Sun Grid Engine as the cluster management application, and > creating a VM server and installing PipelinePilot for one research > group. > > KEYS: Worked with USERS, Managed a Cluster (Linux-Sungard), and Linux. > Centrfy project (synchs unix accounts with Active Directory), completing > the set up of the new linux cluster using Cluster File Services for > connection to the storage appliance and Sun Grid Engine as the cluster > management application, and creating a VM server and installing > PipelinePilot for one research group. > > *5 to 10 years in system administration and application support in a > Unix/Linux environment > *3 to 5 years experience in Shell scripting (tcsh and bash), TCP/IP, > NFS, CIFS (SMB), PAM, Apache, JBoss, DHCP and SystemImager. > *2 to 3 years experience managing a high availability Linux cluster; > experience with Cluster File Systems, Beowulf parallel clusters, Moab, > Torque/PBS, MPI and OS upgrades or equivalent products/environments > *Working knowledge of Perl, network information directories (like NIS, > Active Directory, LDAP, K), PHP, Java, gnome, KDE > *Experience in configuring Beowulf clusters using Sun Grid Engine > > > Best Regards, > > Jaime Biggs > Director of Recruitment > Ivesia Solutions, Inc. > 2 Keewaydin Dr. > Salem, NH 03079 > tel: 800-871-1510 x 2407 > tel: 603-685-2407 > fax: 603-890-1276 > jbiggs at ivesia.com > www.ivesia.com > > Jaime Biggs > Director of Recruitment > Ivesia Solutions, Inc. > 2 Keewaydin Dr. > Salem, NH 03079 > tel: 800-871-1510 x 2407 > tel: 603-685-2407 > fax: 603-890-1276 > jbiggs at ivesia.com > www.ivesia.com > > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Prentice Bisbal > Sent: Friday, June 25, 2010 10:28 AM > To: Beowulf Mailing List > Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 > > Beowulfers, > > One of my Fortran programmers had to increase the precision of his > program so he switched from REAL*8 to REAL*16 which changes the size of > his variables from 64 bits to 128 bits. The program now takes 32x longer > to run. > > I'm not an expert on processor archtitecture, etc., but I do know that > once the size of a variable exceeds the size of the processors > registers, things will slow down considerably. Is his 32x performance > degradation in line with this? > > Is there any way to reduce this degradation? Would The GNU GMP library > (or some other library) help speed things up? > > -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ From ljdursi at scinet.utoronto.ca Fri Jun 25 08:13:20 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Fri, 25 Jun 2010 11:13:20 -0400 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <4C24BCFF.1040007@ias.edu> References: <4C24BCFF.1040007@ias.edu> Message-ID: <28872C7E-BD70-4E68-906E-6E3A3342D98B@scinet.utoronto.ca> I can certainly imagine 2-8x slowdown; 4x for say multiplication (I believe AMD64 doesn't support quad-precision in hardware, so everything has to be emulated) and 2x for the extra memory bandwidth. 32x seems harsh, but isn't obviously crazy. This sounds a lot like blindly using a sledgehammer, though. If the user absolutely requires quad-precision everywhere because they need precision everywhere in their calculation better than one part in 1e16, then they're basically just doomed; but there are very few applications in that regime. Likely there's some part of their problem which is particularly sensitive to the numerics (or they're just using crappy numerics everywhere). One nice thing about the flurry of GPGPU activity is that it's inspired a resurgence of interest in `mixed precision algorithms', where parts of the numerics are implemented (or emulated) at very high precision, and others are implemented at lower precision. It might be worth googling around a bit for their particular problem to see if people have implemented that sort of approach for their particular problem. Of course if they really really need quad precision they should find an architecture (the Power series) that supports quad precision in hardware; but they'll always end up having to pay the 2x memory bandwidth penalty, no way around that. The Gnu GMP, which is very cute and well implemented, is definitely not a way to make things go *faster*. It may well be faster than the other arbitrary-precision libraries out there, but I would expect it to be slower than (fixed) quad precision. On the other hand, if there's only a small portion of the code that needs that approach and the rest can be done in double, there may not be a huge speed penalty. Jonathan -- Jonathan Dursi From niftyompi at niftyegg.com Fri Jun 25 08:54:24 2010 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Fri, 25 Jun 2010 08:54:24 -0700 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <4C24BCFF.1040007@ias.edu> References: <4C24BCFF.1040007@ias.edu> Message-ID: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote: > > Beowulfers, > > One of my Fortran programmers had to increase the precision of his > program so he switched from REAL*8 to REAL*16 which changes the size of > his variables from 64 bits to 128 bits. The program now takes 32x longer > to run. > I am surprised that it works as support in things like the math lib, log and trig functions could be missing. Which compiler is he using? -- T o m M i t c h e l l Found me a new hat, now what? From ntmoore at gmail.com Fri Jun 25 10:30:39 2010 From: ntmoore at gmail.com (Nathan Moore) Date: Fri, 25 Jun 2010 12:30:39 -0500 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: I agree, it seems odd that the OS/compiler has a 128 bit math library available. I certainly hope that your seemingly correct answer is not being corrupted when you compute a 64-bit sine of a 128 bit number... I used GMP's arbitrary precision library (rational number arithmetic) for my thesis a few years back. It was very easy to implement, but not fast (better on x86 hardware than sun/sgi/power as I recall). I too am curious about the sort of algorithm that would require that much precision. (for me, it was inverting a probability distribution that was piecewise-defined, described here, http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk about their thesis...). Nathan On Fri, Jun 25, 2010 at 10:54 AM, Nifty Tom Mitchell wrote: > On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote: >> >> Beowulfers, >> >> One of my Fortran programmers had to increase the precision of his >> program so he switched from REAL*8 to REAL*16 which changes the size of >> his variables from 64 bits to 128 bits. The program now takes 32x longer >> to run. >> > > I am surprised that it works as support in things like > the math lib, log and trig functions could be missing. > Which compiler is he using? > > > > > > -- > ? ? ? ?T o m ?M i t c h e l l > ? ? ? ?Found me a new hat, now what? > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - From prentice at ias.edu Fri Jun 25 10:35:52 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 25 Jun 2010 13:35:52 -0400 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: <4C24E8F8.5050203@ias.edu> Nifty Tom Mitchell wrote: > On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote: >> Beowulfers, >> >> One of my Fortran programmers had to increase the precision of his >> program so he switched from REAL*8 to REAL*16 which changes the size of >> his variables from 64 bits to 128 bits. The program now takes 32x longer >> to run. >> > > I am surprised that it works as support in things like > the math lib, log and trig functions could be missing. > Which compiler is he using? > I'm 99% sure he's using the Intel Fortran Compiler, ifort. -- Prentice From prentice at ias.edu Fri Jun 25 10:57:41 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 25 Jun 2010 13:57:41 -0400 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: <4C24EE15.5030201@ias.edu> This may be naive, but I assumed that if the language supports real*16, then the language and compiler would have to support all of the functions that are native to the language (add, subtract, and whatever else the Fortran standard specifies), would have to handle real*16 operands as well. Is that a correct assumption? I can see how more complicated functions and libraries built off the simpler functions would be a problem. I forwarded some of the e-mails from this discussion to the programmer, so he'd understand the possible issues. I don't know exactly what problem this person is tackling with this program, but I can say he is a theoretical physicist whose research includes quantum mechanics and field theory. Prentice Nathan Moore wrote: > I agree, it seems odd that the OS/compiler has a 128 bit math library > available. I certainly hope that your seemingly correct answer is not > being corrupted when you compute a 64-bit sine of a 128 bit number... > > I used GMP's arbitrary precision library (rational number arithmetic) > for my thesis a few years back. It was very easy to implement, but > not fast (better on x86 hardware than sun/sgi/power as I recall). I > too am curious about the sort of algorithm that would require that > much precision. (for me, it was inverting a probability distribution > that was piecewise-defined, described here, > http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk > about their thesis...). > > Nathan > > > > On Fri, Jun 25, 2010 at 10:54 AM, Nifty Tom Mitchell > wrote: >> On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote: >>> Beowulfers, >>> >>> One of my Fortran programmers had to increase the precision of his >>> program so he switched from REAL*8 to REAL*16 which changes the size of >>> his variables from 64 bits to 128 bits. The program now takes 32x longer >>> to run. >>> >> I am surprised that it works as support in things like >> the math lib, log and trig functions could be missing. >> Which compiler is he using? >> >> >> >> >> >> -- >> T o m M i t c h e l l >> Found me a new hat, now what? >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ From mathog at caltech.edu Fri Jun 25 11:26:20 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 25 Jun 2010 11:26:20 -0700 Subject: [Beowulf] Re: Peformance penalty when using 128-bit reals on AMD64 Message-ID: Prentice Bisbal wrote: > One of my Fortran programmers had to increase the precision of his > program so he switched from REAL*8 to REAL*16 which changes the size of > his variables from 64 bits to 128 bits. The program now takes 32x longer > to run. Doubling the size of the variables can increase the size of the arrays that hold them so that what once fit comfortably into the fastest parts of memory no longer does. Depending on memory access patterns that can easily result in a 16X drop in speed. (2X would normally be lost either way because there is twice as much data to move.) You didn't say why he had to change the precision. If it was a numerical stability issue, well, if the algorithm doesn't work for R*8 going to R*16 may not be a reliable way to calm things down. If this is a case where the exponents are out of range perhaps the whole problem can be scaled up or down by some constant factor so that the numbers once again fit into the range of exponents supported by R*8? Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From glykos at mbg.duth.gr Fri Jun 25 15:04:38 2010 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Sat, 26 Jun 2010 01:04:38 +0300 (EEST) Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: > http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk > about their thesis...). :-)) (sorry, sorry, I couldn't resist the temptation). -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ From derekr42 at gmail.com Fri Jun 25 09:01:34 2010 From: derekr42 at gmail.com (Derek R.) Date: Fri, 25 Jun 2010 11:01:34 -0500 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: <4C24BCFF.1040007@ias.edu> References: <4C24BCFF.1040007@ias.edu> Message-ID: Prentice, As was said before, I don't believe that x64 processor architectures support 128 bit precision instructions either (I did glance through the official AMD manuals, and I've read the first 3 in the set for another project, and I can't recall anything about operating on variables that large; storing values of that precision, yes, but not multiplying and storing the results in registers). The results would overflow the registers and then you'd have to fall back on cache (which could be entirely doable, but you'd have to code in assembler to ensure that (a) the results don't fall out of cache and (b) that you are fetching the proper cache lines to obtain your results) or main memory (which would once again involve coding in assembly language). One way I think you might be able to do this is via some of the SIMD multimedia instructions built into the processor. I only gave that volume of the x64 (x86_64, AMD64, tomato-vs-tomato) manuals a cursory glance as that's never been my concern, but I do believe that the processor architecture does indeed support that level of precision and has the instructions to store the rather large results in contiguous registers. Of course, I don't know what this would do to your code. I'd suggest 4 things : 1) Order a set of the AMD64 manuals (they used to be free, not sure now) from AMD 2) Look at a cheap, brute force solution - I'd suggest SSD disks for swap, perhaps (that's the most likely way I can think of the performance degradation you're seeing happening - going out to swap - it's easy and cheap to test on one system, and if it reduces it to a more acceptable wall clock time then see if you can live with that) 3) Find a project that utilizes the CPU's performance counters and measure exactly what is happening - it could be something quite simple that the compiler is doing wrong and you can fix w/ a few flags or a little bit of inline assembly code (I'm no FORTRAN programmer, but whatever standard you're using should support it if the compiler does, and most of them do)...I haven't done this in quite a while, perfctr used to be the standard. What's the current Linux best-practice standard? 4) Start investigating other solutions in terms of CPU/GPU solutions (if it's that important) That's my $0.02 USD that I can add to this discussion on very little sleep, I'll mail you if further inspiration hits with more espresso. I hope it helps. And I can't really comment of the feasibility of GMP libraries as I've never used them. Regards, Derek R. On Fri, Jun 25, 2010 at 9:28 AM, Prentice Bisbal wrote: > Beowulfers, > > One of my Fortran programmers had to increase the precision of his > program so he switched from REAL*8 to REAL*16 which changes the size of > his variables from 64 bits to 128 bits. The program now takes 32x longer > to run. > > I'm not an expert on processor archtitecture, etc., but I do know that > once the size of a variable exceeds the size of the processors > registers, things will slow down considerably. Is his 32x performance > degradation in line with this? > > Is there any way to reduce this degradation? Would The GNU GMP library > (or some other library) help speed things up? > > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjad11 at gmail.com Sat Jun 26 13:12:15 2010 From: amjad11 at gmail.com (amjad ali) Date: Sat, 26 Jun 2010 16:12:15 -0400 Subject: [Beowulf] MPI Persistent Comm Question Message-ID: Hi all, What is the be the best way of using MPI persistent communication in an iterative/repetative kind of code about calling MPI_Free(); Should we call MPI_Free() in every iteration or only once when all the iterations/repetitions are performed? Means which one is the best out of following two: (1) Call this subroutines 1000 times ============================= call MPI_RECV_Init() call MPI_Send_Init() call MPI_Startall() call MPI_Free() ============================= (2) Call this subroutines 1000 times =========================== call MPI_RECV_Init() call MPI_Send_Init() call MPI_Startall() ========================== call MPI_Free() --------- call it only once at the end. Thanks in advance. best regards AA -------------- next part -------------- An HTML attachment was scrubbed... URL: From niftyompi at niftyegg.com Sat Jun 26 15:40:24 2010 From: niftyompi at niftyegg.com (NiftyOMPI Tom Mitchell) Date: Sat, 26 Jun 2010 15:40:24 -0700 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: On Fri, Jun 25, 2010 at 10:30 AM, Nathan Moore wrote: ...snip... > I used GMP's arbitrary precision library (rational number arithmetic) > for my thesis a few years back. ?It was very easy to implement, but > not fast (better on x86 hardware than sun/sgi/power as I recall). ?I > too am curious about the sort of algorithm that would require that > much precision. A lot can depend on the dynamic range of the values being operated on. If there is a mix of very large and very small values odd results can surface especially in parallel code. Also basic statistics where the squared values can sometimes unexpectedly overflow a computation when the "input" is well within range. It does make a lot of sense to test code and data with 32 and 64 bit floating point to see if odd results surface. It would be nice if systems+compilers had the option of 128 and even 256 bit operations so code could be tested for sensitivity that matters. I sort of wish precision was universally an application ./configure or a #define and while I am dreaming, 128 and 256 bit versions would just run... A +24x slower run would validate a lot of codes and critical runs. In a 30 second scan of GMP's arbitrary precision library I cannot tell if 32 and 64bit sizes fall out as equal in performance to native types. -- NiftyOMPI T o m M i t c h e l l From lindahl at pbm.com Sat Jun 26 17:39:34 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Sat, 26 Jun 2010 17:39:34 -0700 Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64 In-Reply-To: References: <4C24BCFF.1040007@ias.edu> <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net> Message-ID: <20100627003934.GJ21079@bx9.net> On Sat, Jun 26, 2010 at 03:40:24PM -0700, NiftyOMPI Tom Mitchell wrote: > In a 30 second scan of GMP's arbitrary precision library I cannot tell > if 32 and 64bit sizes fall out as equal in performance to native types. No. It's great for arbitrary large sizes and not so good for 128 bits, compared to a library that does only 128 bits. It makes no attempt to do 64 or 32 bit stuff using native types. -- greg From Bill.Rankin at sas.com Mon Jun 28 09:05:12 2010 From: Bill.Rankin at sas.com (Bill Rankin) Date: Mon, 28 Jun 2010 16:05:12 +0000 Subject: [Beowulf] MPI Persistent Comm Question In-Reply-To: References: Message-ID: <76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com> Uhmmm, what is "MPI_Free()"? It does not appear to be part of the MPI2 standard. MPI_Free_mem() is part of the standard, but is used in conjunction with MPI_Alloc_mem() and doesn't seem to refer to what you are describing here. Is it a local procedure and if so, what does it do? -bill From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of amjad ali Sent: Saturday, June 26, 2010 4:12 PM To: Beowulf Mailing List Subject: [Beowulf] MPI Persistent Comm Question Hi all, What is the be the best way of using MPI persistent communication in an iterative/repetative kind of code about calling MPI_Free(); Should we call MPI_Free() in every iteration or only once when all the iterations/repetitions are performed? Means which one is the best out of following two: (1) Call this subroutines 1000 times ============================= call MPI_RECV_Init() call MPI_Send_Init() call MPI_Startall() call MPI_Free() ============================= (2) Call this subroutines 1000 times =========================== call MPI_RECV_Init() call MPI_Send_Init() call MPI_Startall() ========================== call MPI_Free() --------- call it only once at the end. Thanks in advance. best regards AA -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Mon Jun 28 12:40:26 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 28 Jun 2010 15:40:26 -0400 (EDT) Subject: [Beowulf] MPI Persistent Comm Question In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com> References: <76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com> Message-ID: > Uhmmm, what is "MPI_Free()"? he probably meant MPI_Request_free > (1) > Call this subroutines 1000 times > ============================= > call MPI_RECV_Init() > call MPI_Send_Init() > call MPI_Startall() > call MPI_Free() > ============================= > > (2) > Call this subroutines 1000 times > =========================== > call MPI_RECV_Init() > call MPI_Send_Init() > call MPI_Startall() > ========================== > call MPI_Free() --------- call it only once at the end. I've never even seen these MPI_Start-related interfaces in use, but MPI_Request_free appears to be callable once per MPI_*_Init. that would argue for pairing as in the first sequence. afaikt, the point of the interface is actually init/start+/free - that is, you set up a persistent send and kick it into action many times. but only free it once. From rpnabar at gmail.com Tue Jun 29 13:26:20 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 29 Jun 2010 15:26:20 -0500 Subject: [Beowulf] Survey how industrial companies use their HPC Resources In-Reply-To: <814583.50541.qm@web30602.mail.mud.yahoo.com> References: <20100615201226.GF21791@bx9.net> <814583.50541.qm@web30602.mail.mud.yahoo.com> Message-ID: On Wed, Jun 16, 2010 at 7:08 PM, Buccaneer for Hire. wrote: > --- On Tue, 6/15/10, Greg Lindahl wrote: > >> From: Greg Lindahl >> >> I noticed that the country list doesn't include the USA. >> > I just made the assumption if you left it blank that it was understood. :) I noticed that under interconnect options there was no 10GigE listed. Are any others on the list using this or am I the only one. -- Rahul From rpnabar at gmail.com Tue Jun 29 13:37:47 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 29 Jun 2010 15:37:47 -0500 Subject: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity In-Reply-To: <4C08BBCA.5070508@diamond.ac.uk> References: <201005251900.o4PJ0ElP016422@bluewest.scyld.com> <20100525194056.GB16022@kaizen.mayo.edu> <4C08BBCA.5070508@diamond.ac.uk> Message-ID: On Fri, Jun 4, 2010 at 3:39 AM, Tina Friedrich wrote: > We've had that happen on some of our servers. Currently using the > disable_msi workaround, which seems to have stopped it. I believe there's > supposed to be a fix in the latest Red Hat kernel but we haven't really > tested that yet. I saw the exact same symptoms as Tina. Not a hard-correlation but I mostly saw it during periods of high NFS loads. The disable_msi workaround does work like a charm. Before that my only option was to log in locally via console and then do a ifdown; ifup on the interface. -- Rahul From rpnabar at gmail.com Tue Jun 29 22:30:12 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Jun 2010 00:30:12 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? Message-ID: The Top500 list has many useful metrics but I didn't see any $$ based metrics there. Are there any lists that document the $-per-teraflop (apologies to international members!) of any of the systems in the Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't give me anything useful. I'm speculating, one reason could be that sites are loath to disclose their exact $ purchase prices etc. But on the other hand for most of the publicly owned systems this should be accessible information anyways. I was just thinking that this might be an interesting parameter to track. I was also curious as to when systems become larger is there an economy of scale in the Beowulf world? i.e. for something like Jaguar or Kraken is the $/teraflop much lower than what it is for my tiny 100-node system. Another question could be: Is it cheaper to assemble 100 Teraflops of capacity in the US or WU or China etc. Of course, HPC is not really commoditized so a Teraflop based $ value may not be strictly an apples-to-apples comparison but still..... Just wondering what statistics are available out there...... -- Rahul From lindahl at pbm.com Tue Jun 29 22:50:26 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 29 Jun 2010 22:50:26 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <20100630055026.GF28068@bx9.net> On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote: > The Top500 list has many useful metrics but I didn't see any $$ based > metrics there. Other communities with $$-based metrics haven't had much success with them. In HPC, many contracts are multi-year, multi-delivery, or, they include significant extra stuff beyond the iron. -- greg From prentice at ias.edu Wed Jun 30 05:45:17 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 30 Jun 2010 08:45:17 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <4C2B3C5D.9030405@ias.edu> Even in publicly-owned computers, I'm sure the exact price of the cluster is well hidden. As Greg said, the cost might be buried in multi-year contract or as part of a larger contract. These large supercomputers at national labs or large universities are often provided by the vendor for little or no profit (maybe even a loss) in exchange for prestige/advertising opportunities or R&D opportunities. They make up this loss by selling to money-making corporations for a much bigger margin. For example, I would not be surprised if IBM practically gave away RoadRunner to Los Alamos in exchange for the computing expertise at Los Alamos to help develop such an architecture and then be able to say that IBM builds the world's fastest computers (and that your company can have one just like it, for a price). Oh, and the users at Los Alamos probably provide lots of feedback to IBM which helps them build better systems in the future. (Don't shoot me if I'm wrong. I'm just theorizing here) I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a Dept of Energy National Lab, and I can tell you many of the the systems sold to PPPL when I worked there was sold under a NDA, preventing anyone from discussing the price. Yes, the budgets are public information, but that tells you the computer hardware budget for a year, not how much was spent on each computer. Prentice Rahul Nabar wrote: > The Top500 list has many useful metrics but I didn't see any $$ based > metrics there. Are there any lists that document the $-per-teraflop > (apologies to international members!) of any of the systems in the > Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't > give me anything useful. > > I'm speculating, one reason could be that sites are loath to disclose > their exact $ purchase prices etc. But on the other hand for most of > the publicly owned systems this should be accessible information > anyways. I was just thinking that this might be an interesting > parameter to track. I was also curious as to when systems become > larger is there an economy of scale in the Beowulf world? i.e. for > something like Jaguar or Kraken is the $/teraflop much lower than what > it is for my tiny 100-node system. Another question could be: Is it > cheaper to assemble 100 Teraflops of capacity in the US or WU or China > etc. > > Of course, HPC is not really commoditized so a Teraflop based $ value > may not be strictly an apples-to-apples comparison but still..... > > Just wondering what statistics are available out there...... > -- Prentice From joshua_mora at usa.net Wed Jun 30 06:10:47 2010 From: joshua_mora at usa.net (Joshua mora acosta) Date: Wed, 30 Jun 2010 08:10:47 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? Message-ID: <386oFdNJv5776S01.1277903447@web01.cms.usa.net> I think the money part will be difficult to get (it is like a politically incorrect question). Nevertheless, you can split the money in two parts: purchase (which I am sure you will never get) and electric bill for kipping the system up and running while you run HPL and when you run stream. Then you could try at least to put the cost of the electric bill. electric_bill_(USD)/performance_(TFLOPs) The electric bill though will change for a given amount of kWs depending on the contract/location you establish with the electric company. So it is difficult to get that info as well. So it is perhaps better to compare systems in terms of TFLOP/kW. And factor in there what you are capable of negotiating on electric cost and purchase and support. Going back to the easy part: For instance on GPU_CPU cluster based systems, you can achieve 1.1(real_DP_TFLOP)/kW with a ratio GPU/CPU=2 With that I have factored a whole rack with HW and SW stacks for 10TF real double precission under 400K USD cost. Joshua ------ Original Message ------ Received: 12:45 AM CDT, 06/30/2010 From: Rahul Nabar To: Beowulf Mailing List Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? > The Top500 list has many useful metrics but I didn't see any $$ based > metrics there. Are there any lists that document the $-per-teraflop > (apologies to international members!) of any of the systems in the > Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't > give me anything useful. > > I'm speculating, one reason could be that sites are loath to disclose > their exact $ purchase prices etc. But on the other hand for most of > the publicly owned systems this should be accessible information > anyways. I was just thinking that this might be an interesting > parameter to track. I was also curious as to when systems become > larger is there an economy of scale in the Beowulf world? i.e. for > something like Jaguar or Kraken is the $/teraflop much lower than what > it is for my tiny 100-node system. Another question could be: Is it > cheaper to assemble 100 Teraflops of capacity in the US or WU or China > etc. > > Of course, HPC is not really commoditized so a Teraflop based $ value > may not be strictly an apples-to-apples comparison but still..... > > Just wondering what statistics are available out there...... > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Wed Jun 30 06:25:06 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 30 Jun 2010 06:25:06 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: Message-ID: On 6/29/10 10:30 PM, "Rahul Nabar" wrote: > The Top500 list has many useful metrics but I didn't see any $$ based > metrics there. Are there any lists that document the $-per-teraflop > (apologies to international members!) of any of the systems in the > Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't > give me anything useful. > > I'm speculating, one reason could be that sites are loath to disclose > their exact $ purchase prices etc. But on the other hand for most of > the publicly owned systems this should be accessible information > anyways. > It's harder than you think to come up with a "cost" Is it just the hardware purchase cost? Or do you count the assembly cost? What about infrastructure mods to hold all those racks? And assuming you *do* get a number, how do you compare it fairly. For instance, if you had 100 discount PCs put on the gym floor by volunteer labor vs buying an already integrated rack? Do you could integration support? Applications porting? As far as publically funded ones go.. You might wind up with a big accounting challenge to go through hundreds of invoices and contracts. In California, for instance, the CA Public Records Act says you can go and ask for pretty much any record that doesn't have personally identifiable information. But that's a long way from getting a nice "here's how much the supercomputer cost". You might have budgets and invoices from dozens of firms to go through and find stuff. > > > > I was just thinking that this might be an interesting > parameter to track. I was also curious as to when systems become > larger is there an economy of scale in the Beowulf world? i.e. for > something like Jaguar or Kraken is the $/teraflop much lower than what > it is for my tiny 100-node system. Another question could be: Is it > cheaper to assemble 100 Teraflops of capacity in the US or WU or China > etc. I think it *is* interesting, and would be useful, because managers are always having to make decisions about make vs buy, or when to buy (do we buy now and get started, or wait 1 year, when the machines are faster for the same price) From john.hearns at mclaren.com Wed Jun 30 06:38:04 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Jun 2010 14:38:04 +0100 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com> > I'm speculating, one reason could be that sites are loath to disclose > their exact $ purchase prices etc. But on the other hand for most of > the publicly owned systems this should be accessible information > anyways. I was just thinking that this might be an interesting > parameter to track. I was also curious as to when systems become > larger is there an economy of scale in the Beowulf world? i.e. for > something like Jaguar or Kraken is the $/teraflop much lower than what > it is for my tiny 100-node system. Another question could be: Is it > cheaper to assemble 100 Teraflops of capacity in the US or WU or China I don't think you'll find that information anywhere readily. Also consider the difference between peak HPL flops rating and the useful work you will get out of a system. Sticking my neck out slightly here, systems with lots of GPUs will score highly on the $$/flop ratings - but do you get that amount of work under real-world loads? John Hearns The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From landman at scalableinformatics.com Wed Jun 30 07:13:34 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Jun 2010 10:13:34 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B3C5D.9030405@ias.edu> References: <4C2B3C5D.9030405@ias.edu> Message-ID: <4C2B510E.5050406@scalableinformatics.com> Prentice Bisbal wrote: > These large supercomputers at national labs or large universities are > often provided by the vendor for little or no profit (maybe even a loss) s/maybe/usually/ > in exchange for prestige/advertising opportunities or R&D opportunities. Hah. Allow me to restate this. Hah. It is extraordinarily rare that an entity will give you permission to use their name in any advertising. The most you can hope for is, generally, a press release. Prestige does not translate into (profitable) revenue. > They make up this loss by selling to money-making corporations for a > much bigger margin. Hmmm .... I think there may be a fundamental disconnect between the assumptions of folks in academia and the reality of this particular market. I am not bashing on Prentis. I would like to point out that the "much larger margins" in a cutthroat business such as clusters are ... er ... not much larger. I could bore people with anecdotes, but the fundamental take home message is, if you believe this (much larger margins bit), you are mistaken. The market for commercial HPC is under intense downward price pressure. More so than ever before. Companies, when they have money to spend, want to spend less of it, and get the same or better systems/performance than in the past. What we are seeing is a fundamental change of business model, over to one that keeps upfront capital costs as low as possible, and pushes things to expense columns. This is in part what is driving the significant interest in accelerators /APUs, and in remote cycle rental. The costs associated with powering and cooling accelerators are minimal as compared to small/mid sized clusters. The costs associated with very large clusters can be made into pure on-demand expenses with remote cycle rental. > For example, I would not be surprised if IBM practically gave away > RoadRunner to Los Alamos in exchange for the computing expertise at Los > Alamos to help develop such an architecture and then be able to say that > IBM builds the world's fastest computers (and that your company can have > one just like it, for a price). Oh, and the users at Los Alamos probably > provide lots of feedback to IBM which helps them build better systems in > the future. (Don't shoot me if I'm wrong. I'm just theorizing here) Don't sell the folks at TJ Watson/IBM Research short. They are a very bright group. The co-R&D elements are a way IBM can dominate the HPC bits at the high end, and provide something that looks like an in-kind contribution type of model so that LANL and others can go to their granting agencies and get either more money, or fulfill specific contract points. IBM is a business, and in most cases, won't generally have a particular business unit make a loss for "prestige" points. That doesn't make the board/shareholders happy. > I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a > Dept of Energy National Lab, and I can tell you many of the the systems > sold to PPPL when I worked there was sold under a NDA, preventing anyone > from discussing the price. Yes, the budgets are public information, but > that tells you the computer hardware budget for a year, not how much was > spent on each computer. Well part of that is to prevent shopping the quote. We see this *all* the time. Someone asks for a quote, you provide it. They then go and take your quote, elide specific company information, and then send it around asking others to beat it. An NDA gives you a mechanism to stop this with a specific legal enjoinder from discussing terms. Copyrighting quotes and specifically restricting redistribution of content and information contained is another method. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From Bill.Rankin at sas.com Wed Jun 30 08:11:54 2010 From: Bill.Rankin at sas.com (Bill Rankin) Date: Wed, 30 Jun 2010 15:11:54 +0000 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <386oFdNJv5776S01.1277903447@web01.cms.usa.net> References: <386oFdNJv5776S01.1277903447@web01.cms.usa.net> Message-ID: <76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com> > I think the money part will be difficult to get (it is like a > politically > incorrect question). Joe addressed this pretty well. For the large systems, it's almost always under NDA. > Nevertheless, you can split the money in two parts: purchase (which I > am sure > you will never get) and electric bill for kipping the system up and > running while you run HPL and when you run stream. Once you start looking at the power bills (for both the system as well as all the associated infrastructure, like cooling) then you pretty much need to start looking at approximations for the total cost of ownership (TCO). Depending on the organization, many of these costs are well hidden. We went through this exercise when I was at Duke (with due credit to Rob Brown who did a lot of the heavy lifting). Some of the things you have to consider are: - Power (for both machines and cooling). Given commercial rates at the time (~2004) this worked out to about $1/Watt/year. That makes for $300k/year for a 1000 node cluster at 300W/node. - Don't forget depreciation on all that support equipment. While your cluster may have a useful lifetime of around 3-5 years, all those air handlers, power conditioners and UPS's have lifetimes too. Figure 10-15 years (if you can reuse them) and factor that amortized cost into your bottom line. - Staff salaries, both for administration and operations/monitoring. Loaded salary for a decent cluster admin may be $100k/year or more. Bottom line is that you could spend 30%-50% (or more) additional dollars beyond the cost of the hardware just to cover the basic support needs for the facility. -b From prentice at ias.edu Wed Jun 30 08:37:30 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 30 Jun 2010 11:37:30 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B510E.5050406@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> Message-ID: <4C2B64BA.3000704@ias.edu> Joe Landman wrote: > Prentice Bisbal wrote: > >> These large supercomputers at national labs or large universities are >> often provided by the vendor for little or no profit (maybe even a loss) > > s/maybe/usually/ > >> in exchange for prestige/advertising opportunities or R&D opportunities. > > Hah. Allow me to restate this. > > Hah. > > It is extraordinarily rare that an entity will give you permission to > use their name in any advertising. The most you can hope for is, > generally, a press release. And what would you call all the press that IBM got for Roadrunner and Cray got for Jaguar when those systems were at the top of the Top500? Also, at SC09, the ORNL booth was showing off that they now had the top system, and weren't hiding the fact that it was built by Cray. I would call that a lot more than just a "press release". All that industry media coverage is a lot better advertising than any single paid advertisement. > > Prestige does not translate into (profitable) revenue. > I think you meant to say "Prestige does not always translate into profitable revenue. Rolls Royce, Bentley, and Lamborghini are just a few examples of prestige not translating into profits. Lexus, Acura, and Infiniti are all examples of prestige translating into huge profits. Lexus, Acura, and Infiniti autos aren't radically different from the Toyotas, Hondas, and Nissans they are based on, but the cost more, mostly because of the prestige of the upmarket name. >> They make up this loss by selling to money-making corporations for a >> much bigger margin. > > Hmmm .... > > I think there may be a fundamental disconnect between the assumptions of > folks in academia and the reality of this particular market. I am not > bashing on Prentis. I would like to point out that the "much larger > margins" in a cutthroat business such as clusters are ... er ... not > much larger. > > I could bore people with anecdotes, but the fundamental take home > message is, if you believe this (much larger margins bit), you are > mistaken. Any profit is "much larger" than a loss. > >> For example, I would not be surprised if IBM practically gave away >> RoadRunner to Los Alamos in exchange for the computing expertise at Los >> Alamos to help develop such an architecture and then be able to say that >> IBM builds the world's fastest computers (and that your company can have >> one just like it, for a price). Oh, and the users at Los Alamos probably >> provide lots of feedback to IBM which helps them build better systems in >> the future. (Don't shoot me if I'm wrong. I'm just theorizing here) > > Don't sell the folks at TJ Watson/IBM Research short. They are a very > bright group. The co-R&D elements are a way IBM can dominate the HPC > bits at the high end, and provide something that looks like an in-kind > contribution type of model so that LANL and others can go to their > granting agencies and get either more money, or fulfill specific > contract points. I wasn't selling the brains IBM short. We all know that two heads are better then one when tackling a problem. I was saying that if IBM geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL geniuses = better. > > IBM is a business, and in most cases, won't generally have a particular > business unit make a loss for "prestige" points. That doesn't make the > board/shareholders happy. > How much profit did IBM make off of Deep Blue when it beat Gary Kasparov? None that I know of. However, it did provide IBM R&D opportunities and when it finally beat Gary Kasparov, plenty of free advertising for IBM through news coverage, and ...prestige. -- Prentice From landman at scalableinformatics.com Wed Jun 30 09:10:17 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Jun 2010 12:10:17 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B64BA.3000704@ias.edu> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> Message-ID: <4C2B6C69.60309@scalableinformatics.com> Prentice Bisbal wrote: >> It is extraordinarily rare that an entity will give you permission to >> use their name in any advertising. The most you can hope for is, >> generally, a press release. > > And what would you call all the press that IBM got for Roadrunner and > Cray got for Jaguar when those systems were at the top of the Top500? "extraordinarily rare" > Also, at SC09, the ORNL booth was showing off that they now had the top > system, and weren't hiding the fact that it was built by Cray. I would > call that a lot more than just a "press release". All that industry Again, academic/national labs gain prestige points by doing this. Prestige rarely, if ever, turns into revenue, and even less frequently, into profit. > media coverage is a lot better advertising than any single paid > advertisement. Hmm .... see above. Did this media coverage inspire you to purchase a Blue Gene? Or an XT6? > >> Prestige does not translate into (profitable) revenue. >> > > I think you meant to say "Prestige does not always translate into > profitable revenue. Thank you ... I agree, I implied the "extraordinarily rare" aspect of this. Basically prestige plus $4 + change will get you a Grande sized mocha triple shot from Starbucks. Not much more than that. This isn't a jaundiced view (ok, I hope not!) of the situation ... for-profit companies can't afford to be paid in "prestige". Put it another way. Supposed the IAS's granting agencies decided that they would provide grants that didn't cover complete costs, but provided "prestige" for getting the grant in the first place. Even though IAS isn't a for-profit institution per se, it still has bills to pay, and people to employ. This sort of scenario is the case in for-profit HPC in academia. The prestige earned doesn't pay the bills. You can't take the prestige to the bank and earn interest on it. Its an intangible ... good will ... is how I think it is accounted for. But is it valuable? Ask IBM how many more BG's the've sold as a result of that "marketing campaign". As I said, I think there is a fundamental disconnect between what people would like to believe, and the (rather harsh) realities of the market. This is not a bash. This is an observation. > Rolls Royce, Bentley, and Lamborghini are just a few examples of > prestige not translating into profits. Bentley had been up for sale, RR is now owned by Tata, as is Jaguar. Prestige does not pay the bills. And if I am wrong on this, please, educate me ... I'd like to figure out how to do this. > > Lexus, Acura, and Infiniti are all examples of prestige translating into > huge profits. Lexus, Acura, and Infiniti autos aren't radically Er ... no. They are all examples of really good marketing, and not trying to compete on price. > different from the Toyotas, Hondas, and Nissans they are based on, but > the cost more, mostly because of the prestige of the upmarket name. Thats marketing, not prestige. When you introduce a new brand, it doesn't have a history, and hence, no prestige. Marketing is what enables you to attract customers. Some customers will be put off by price. You either lower your pricing to keep them, or you ignore them as a market. The folks you indicated, all ignore the price sensitive elements of the market. Which makes them vulnerable to attack by the Hyundai's and others of the world. > > >>> They make up this loss by selling to money-making corporations for a >>> much bigger margin. >> Hmmm .... >> >> I think there may be a fundamental disconnect between the assumptions of >> folks in academia and the reality of this particular market. I am not >> bashing on Prentis. I would like to point out that the "much larger >> margins" in a cutthroat business such as clusters are ... er ... not >> much larger. >> >> I could bore people with anecdotes, but the fundamental take home >> message is, if you believe this (much larger margins bit), you are >> mistaken. > > Any profit is "much larger" than a loss. So is a $1 USD profit on a $1M USD sale a reasonable profit? And if the loss is $100 USD on the $1M USD sale, is the $1 >> $100 ? No. > >>> For example, I would not be surprised if IBM practically gave away >>> RoadRunner to Los Alamos in exchange for the computing expertise at Los >>> Alamos to help develop such an architecture and then be able to say that >>> IBM builds the world's fastest computers (and that your company can have >>> one just like it, for a price). Oh, and the users at Los Alamos probably >>> provide lots of feedback to IBM which helps them build better systems in >>> the future. (Don't shoot me if I'm wrong. I'm just theorizing here) >> Don't sell the folks at TJ Watson/IBM Research short. They are a very >> bright group. The co-R&D elements are a way IBM can dominate the HPC >> bits at the high end, and provide something that looks like an in-kind >> contribution type of model so that LANL and others can go to their >> granting agencies and get either more money, or fulfill specific >> contract points. > > I wasn't selling the brains IBM short. We all know that two heads are > better then one when tackling a problem. I was saying that if IBM > geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL > geniuses = better. Ahh .... ok. IBM has some really good folks at their research locations. > >> IBM is a business, and in most cases, won't generally have a particular >> business unit make a loss for "prestige" points. That doesn't make the >> board/shareholders happy. >> > > How much profit did IBM make off of Deep Blue when it beat Gary > Kasparov? None that I know of. However, it did provide IBM R&D > opportunities and when it finally beat Gary Kasparov, plenty of free > advertising for IBM through news coverage, and ...prestige. ... which, as you note, translated to no profit. The press however, provided them effectively free marketing. Publish this story, and we don't have to pay for it. ... which is worth ... what? How many BG's did IBM sell as a result of the chess match? How many people made a decision, influenced in part, by that PR and free marketing? Thats the point. Like it or not, prestige doesn't placate shareholders, board members, or wall street. They want to see profits, pure and simple. > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Wed Jun 30 09:17:28 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Jun 2010 11:17:28 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Wed, Jun 30, 2010 at 8:38 AM, Hearns, John wrote: > Sticking my neck out slightly here, systems with lots of GPUs will score > highly on > the $/flop ratings - but do you get that amount of work under > real-world loads? Sticking my neck out even more, but maybe that problem can be solved by using the actual Teraflops as opposed to peak-teraflops? I'm curious, how good are these GPU based systems when solving something like the Linpack or the SPEC benchmarks? Of course, I'm not saying either of these benchmarks are representative of a "real-world" load. But perhaps they are closer than a peak-teraflops metric. -- Rahul From joshua_mora at usa.net Wed Jun 30 09:42:39 2010 From: joshua_mora at usa.net (Joshua mora acosta) Date: Wed, 30 Jun 2010 11:42:39 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? Message-ID: <429oFdqpn6192S04.1277916159@web04.cms.usa.net> Now it comes the funny part: Out of that electric bill of say 300k for 1000 nodes at a reasonable efficiency of 30% for a real well tuned application (not Linpack which is >80%) it basically means 90K USD is worth the work. The other 210K US is electric bill wasted waiting for data to hit the caches. Joshua ------ Original Message ------ Received: 10:12 AM CDT, 06/30/2010 From: Bill Rankin To: Joshua mora acosta , Rahul Nabar , Beowulf Mailing List Subject: RE: [Beowulf] dollars-per-teraflop : any lists like the Top500? > > I think the money part will be difficult to get (it is like a > > politically > > incorrect question). > > Joe addressed this pretty well. For the large systems, it's almost always under NDA. > > > Nevertheless, you can split the money in two parts: purchase (which I > > am sure > > you will never get) and electric bill for kipping the system up and > > running while you run HPL and when you run stream. > > Once you start looking at the power bills (for both the system as well as all the associated infrastructure, like cooling) then you pretty much need to start looking at approximations for the total cost of ownership (TCO). Depending on the organization, many of these costs are well hidden. We went through this exercise when I was at Duke (with due credit to Rob Brown who did a lot of the heavy lifting). Some of the things you have to consider are: > > - Power (for both machines and cooling). Given commercial rates at the time (~2004) this worked out to about $1/Watt/year. That makes for $300k/year for a 1000 node cluster at 300W/node. > > - Don't forget depreciation on all that support equipment. While your cluster may have a useful lifetime of around 3-5 years, all those air handlers, power conditioners and UPS's have lifetimes too. Figure 10-15 years (if you can reuse them) and factor that amortized cost into your bottom line. > > - Staff salaries, both for administration and operations/monitoring. Loaded salary for a decent cluster admin may be $100k/year or more. > > Bottom line is that you could spend 30%-50% (or more) additional dollars beyond the cost of the hardware just to cover the basic support needs for the facility. > > -b > > From james.p.lux at jpl.nasa.gov Wed Jun 30 09:49:24 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 30 Jun 2010 09:49:24 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B510E.5050406@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> Message-ID: Comments interspersed below... Joe's comments are generally right on, and I can provide some insight into how governments buy stuff (it's pretty strictly regulated.. far more so than in private industry, but some of the processes seem arcane and bewildering at first glance) Jim > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman > > Prentice Bisbal wrote: > > > These large supercomputers at national labs or large universities are > > often provided by the vendor for little or no profit (maybe even a loss) > > s/maybe/usually/ > > > in exchange for prestige/advertising opportunities or R&D opportunities. > > Hah. Allow me to restate this. > > Hah. > > It is extraordinarily rare that an entity will give you permission to > use their name in any advertising. The most you can hope for is, > generally, a press release. I agree. Here at JPL we have pretty stringent rules under which we can do things in terms of using the name of JPL or NASA. Granted, nothing stops you advertising your product "as sold to NASA", but under no circumstances could we provide any sort of endorsement or recommendation. We can co-author a paper or report with the vendor which reports information from some activity. We can say what we did, and make generalized fact based recommendations. > > Prestige does not translate into (profitable) revenue. > > > They make up this loss by selling to money-making corporations for a > > much bigger margin. > > Hmmm .... > > I think there may be a fundamental disconnect between the assumptions of > folks in academia and the reality of this particular market. I am not > bashing on Prentis. I would like to point out that the "much larger > margins" in a cutthroat business such as clusters are ... er ... not > much larger. I agree here, too. My wife works in commercial IT, and I'd say that they tend to beat the vendors down more than we do here in government funded work. She (and her bosses) have to report against monthly, quarterly, and annual targets for everything. Government work tends to be funded on an annual cycle (October 1st is the FY start date.. funding gets set based on proposals in March/April, and cast into concrete around now for the next FY). Where government work is concerned, the problems usually aren't profit margin (we're happy to give a reasonable rate of return, because screwing the clamps down onto the vendor's last penny usually doesn't work out well.. they'll put their top people on the jobs that have decent returns, especially if your job gets into difficulties). It's the myriad other weird and not so weird conditions (Drug Free Workplace Act, Buy American Act, Foreign Corrupt Practices Act, etc.) resulting from the dual role of government procurement: get something useful done and create social policy. That, and the absolute paranoia that the taxpayer might be getting the short end of the stick results in a substantially larger paperwork burden to prove they aren't. All those folks standing up decrying "waste, fraud, and abuse"... In a commercial entity, one can do a cost benefit trade on, say, inventory losses vs time/effort to keep track of things. Not in government, when someone will be sure to stand up and say "Agency X lost or misplaced 3 laptops out of the 123,000 they have, and this must stop now!" Government work also often pays slow, but reliably. But your bank may not understand, so getting operating capital can be a challenge. > What we are seeing is a fundamental change of business model, over to > one that keeps upfront capital costs as low as possible, and pushes > things to expense columns. Yes.. it makes *this week/month/quarter's* numbers look better, and also gets you out of having to seek capital (which has been hard recently in these odd-times for credit). > > > For example, I would not be surprised if IBM practically gave away > > RoadRunner to Los Alamos in exchange for the computing expertise at Los > > Alamos to help develop such an architecture and then be able to say that > > IBM builds the world's fastest computers (and that your company can have > > one just like it, for a price). Oh, and the users at Los Alamos probably > > provide lots of feedback to IBM which helps them build better systems in > > the future. (Don't shoot me if I'm wrong. I'm just theorizing here) > > Don't sell the folks at TJ Watson/IBM Research short. They are a very > bright group. The co-R&D elements are a way IBM can dominate the HPC > bits at the high end, and provide something that looks like an in-kind > contribution type of model so that LANL and others can go to their > granting agencies and get either more money, or fulfill specific > contract points. This sort of thing is often done under a "Cooperative Research and Development Agreement" or CRADA. This is basically a contract between the govt and the vendor which lays out who is doing what, what they're bringing to the table, and where the intellectual property rights will wind up. For example, I'm working on a space mission now where several vendors have provided equipment under CRADAs. I don't know anything about what's in the agreement, but in general, it's something along the lines of "we give you a ride into space and you get to fly your box and test your new technology" International MoUs for science instruments also work this way. For NASA this is all covered under the "Space Act" > > IBM is a business, and in most cases, won't generally have a particular > business unit make a loss for "prestige" points. That doesn't make the > board/shareholders happy. You got that right.. You'd have to put a number on the prestige and trade it against someone's advertising budget. > > > I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a > > Dept of Energy National Lab, and I can tell you many of the the systems > > sold to PPPL when I worked there was sold under a NDA, preventing anyone > > from discussing the price. Yes, the budgets are public information, but > > that tells you the computer hardware budget for a year, not how much was > > spent on each computer. > > Well part of that is to prevent shopping the quote. We see this *all* > the time. Someone asks for a quote, you provide it. They then go and > take your quote, elide specific company information, and then send it > around asking others to beat it. Exactly.. we do this with "source selection" all the time. The proposals are proprietary and confidential, the reviewers on the Source Evaluation Board all sign NDAs and go work in a special room where all the materials are kept. It's a very, very big deal (even for relatively small procurements). Once the selection is made, then we shred all the materials, and go to negotiate the actual contract with the vendor (which can't change substantially from what they proposed, because otherwise the losing vendors can legitimately complain). During the negotiation process (especially if it is a cost plus fixed fee) the vendor will need to provide a lot of financial details to allow our people to determine if the price is "fair" and that the vendor is giving the government the "lowest price" (if you want to stay out of trouble, do not sell us something for $1000 and then sell it to someone else for $900). That financial detail is generally proprietary (e.g. as a vendor you don't want us telling everyone how much your people are paid and how much you pay in fringe benefits) and wouldn't be disclosed, but the total contract value, and a fair amount of other information, is disclosed. Jim From prentice at ias.edu Wed Jun 30 12:43:27 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 30 Jun 2010 15:43:27 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B6C69.60309@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> Message-ID: <4C2B9E5F.2010602@ias.edu> I'd like to apologize to other beowulfers for going way off-topic. This will be my last post on this topic. Joe Landman wrote: > Prentice Bisbal wrote: > >>> It is extraordinarily rare that an entity will give you permission to >>> use their name in any advertising. The most you can hope for is, >>> generally, a press release. >> >> And what would you call all the press that IBM got for Roadrunner and >> Cray got for Jaguar when those systems were at the top of the Top500? > > "extraordinarily rare" > >> Also, at SC09, the ORNL booth was showing off that they now had the top >> system, and weren't hiding the fact that it was built by Cray. I would >> call that a lot more than just a "press release". All that industry > > Again, academic/national labs gain prestige points by doing this. > Prestige rarely, if ever, turns into revenue, and even less frequently, > into profit. > >> media coverage is a lot better advertising than any single paid >> advertisement. > > Hmm .... see above. Did this media coverage inspire you to purchase a > Blue Gene? Or an XT6? > No because Roadrunner was not a Blue Gene system ;). We need to look beyond Roadrunner selling more roadrunner-like systems and Jaguar selling more Jaguars. The success of Roadrunner and Deep Blue probably didn't sell more Roadrunners and Deep Blues, but I'm sure they had an effect on IBMs stock price, and help sell lower-end IBM systems. IBM dominates the Top 500 right now. I'm sure their success with Roadrunner, Deep Blue, and Blue Gene have something to do with that. If not direct technology transfer, I bet Bob at Acme thinks to himself "IBM has done a lot of great things in supercomputing. They're definitely the experts. I think we should hire them to build and integrate a new 128-node cluster for our comp. chem group. >> >>> Prestige does not translate into (profitable) revenue. >>> >> >> I think you meant to say "Prestige does not always translate into >> profitable revenue. > > Thank you ... I agree, I implied the "extraordinarily rare" aspect of this. > > Basically prestige plus $4 + change will get you a Grande sized mocha > triple shot from Starbucks. Not much more than that. This isn't a > jaundiced view (ok, I hope not!) of the situation ... for-profit > companies can't afford to be paid in "prestige". > > Put it another way. Supposed the IAS's granting agencies decided that > they would provide grants that didn't cover complete costs, but provided > "prestige" for getting the grant in the first place. Even though IAS > isn't a for-profit institution per se, it still has bills to pay, and > people to employ. This sort of scenario is the case in for-profit HPC > in academia. The prestige earned doesn't pay the bills. You can't take > the prestige to the bank and earn interest on it. Its an intangible ... > good will ... is how I think it is accounted for. > > But is it valuable? Ask IBM how many more BG's the've sold as a result > of that "marketing campaign". As I said, I think there is a fundamental > disconnect between what people would like to believe, and the (rather > harsh) realities of the market. This is not a bash. This is an > observation. > >> Rolls Royce, Bentley, and Lamborghini are just a few examples of >> prestige not translating into profits. > > Bentley had been up for sale, RR is now owned by Tata, as is Jaguar. > > Prestige does not pay the bills. And if I am wrong on this, please, > educate me ... I'd like to figure out how to do this. Ask Rolex or Patek Philippe. I'm sure the only reason people drop large $$ on their watches is for the prestige of the name. (By the way - I met several Patek Philippe workers in NYC once. To them Rolex might as well be Casio - they get insulted if you compare their watches to Rolex) > >> >> Lexus, Acura, and Infiniti are all examples of prestige translating into >> huge profits. Lexus, Acura, and Infiniti autos aren't radically > > Er ... no. They are all examples of really good marketing, and not > trying to compete on price. > >> different from the Toyotas, Hondas, and Nissans they are based on, but >> the cost more, mostly because of the prestige of the upmarket name. > > Thats marketing, not prestige. When you introduce a new brand, it > doesn't have a history, and hence, no prestige. Marketing and prestige go hand in hand. Toyota's aggressive marketing of Lexus as a viable alternative to Mercedes gave it prestige. > > Marketing is what enables you to attract customers. Some customers will > be put off by price. You either lower your pricing to keep them, or you > ignore them as a market. The folks you indicated, all ignore the price > sensitive elements of the market. Which makes them vulnerable to attack > by the Hyundai's and others of the world. > >> >> >>>> They make up this loss by selling to money-making corporations for a >>>> much bigger margin. >>> Hmmm .... >>> >>> I think there may be a fundamental disconnect between the assumptions of >>> folks in academia and the reality of this particular market. I am not >>> bashing on Prentis. I would like to point out that the "much larger >>> margins" in a cutthroat business such as clusters are ... er ... not >>> much larger. >>> >>> I could bore people with anecdotes, but the fundamental take home >>> message is, if you believe this (much larger margins bit), you are >>> mistaken. >> >> Any profit is "much larger" than a loss. > > So is a $1 USD profit on a $1M USD sale a reasonable profit? And if the > loss is $100 USD on the $1M USD sale, is the $1 >> $100 ? > > No. > >> >>>> For example, I would not be surprised if IBM practically gave away >>>> RoadRunner to Los Alamos in exchange for the computing expertise at Los >>>> Alamos to help develop such an architecture and then be able to say >>>> that >>>> IBM builds the world's fastest computers (and that your company can >>>> have >>>> one just like it, for a price). Oh, and the users at Los Alamos >>>> probably >>>> provide lots of feedback to IBM which helps them build better >>>> systems in >>>> the future. (Don't shoot me if I'm wrong. I'm just theorizing here) >>> Don't sell the folks at TJ Watson/IBM Research short. They are a very >>> bright group. The co-R&D elements are a way IBM can dominate the HPC >>> bits at the high end, and provide something that looks like an in-kind >>> contribution type of model so that LANL and others can go to their >>> granting agencies and get either more money, or fulfill specific >>> contract points. >> >> I wasn't selling the brains IBM short. We all know that two heads are >> better then one when tackling a problem. I was saying that if IBM >> geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL >> geniuses = better. > > Ahh .... ok. IBM has some really good folks at their research locations. > >> >>> IBM is a business, and in most cases, won't generally have a particular >>> business unit make a loss for "prestige" points. That doesn't make the >>> board/shareholders happy. >>> >> >> How much profit did IBM make off of Deep Blue when it beat Gary >> Kasparov? None that I know of. However, it did provide IBM R&D >> opportunities and when it finally beat Gary Kasparov, plenty of free >> advertising for IBM through news coverage, and ...prestige. > > ... which, as you note, translated to no profit. The press however, > provided them effectively free marketing. Publish this story, and we > don't have to pay for it. > > ... which is worth ... what? > > How many BG's did IBM sell as a result of the chess match? How many > people made a decision, influenced in part, by that PR and free marketing? > > Thats the point. Like it or not, prestige doesn't placate shareholders, > board members, or wall street. They want to see profits, pure and simple. See my first inline comment. Let's end this discussion, before I get the same reputation as that crazy dutch guy. -- Prentice From rpnabar at gmail.com Wed Jun 30 13:21:08 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Jun 2010 15:21:08 -0500 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <20100630055026.GF28068@bx9.net> References: <20100630055026.GF28068@bx9.net> Message-ID: On Wed, Jun 30, 2010 at 12:50 AM, Greg Lindahl wrote: > Other communities with $-based metrics haven't had much success with > them. > > In HPC, many contracts are multi-year, multi-delivery, or, they > include significant extra stuff beyond the iron. Thanks Beowulfers for some interesting comments and discussion there. I guess, the conclusion (as I see it ) is : At the present time there is no good (read "easy", "direct" , "quick" etc.) way of benchmarking the $ costs of my system against others in the ecosystem. For what it's worth my own estimate of our circa. 2010 cluster is $35k/Teraflop (peak). -- Rahul From hahn at mcmaster.ca Wed Jun 30 13:45:09 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 30 Jun 2010 16:45:09 -0400 (EDT) Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: <20100630055026.GF28068@bx9.net> Message-ID: > For what it's worth my own estimate of our circa. 2010 cluster is > $35k/Teraflop (peak). my organization is trying to finalize a cluster that's about $CAD 30/TF. From rchang.lists at gmail.com Wed Jun 30 18:54:54 2010 From: rchang.lists at gmail.com (Richard Chang) Date: Thu, 01 Jul 2010 07:24:54 +0530 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B9E5F.2010602@ias.edu> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> Message-ID: <4C2BF56E.7090303@gmail.com> An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Wed Jun 30 19:41:01 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Jun 2010 22:41:01 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2BF56E.7090303@gmail.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> <4C2BF56E.7090303@gmail.com> Message-ID: <4C2C003D.8000606@scalableinformatics.com> Richard Chang wrote: > You are right when you said that these big companies sell their stuff at > a huge discount, atleast initially,is what I know. > > Here in India, where I live and work, IBM had, 3-4 yrs back, sold a BG/L > for way less than anything, virtually at the price of a normal cluster. > This was done to cut out the competition and to boast about their system > being sold i.e, the first ever BG/L being sold in India. The > competition, as expected, was very livid that could IBM give it off at > such throw away prices. ... we (in the business) call that "buying the business". You literally pay your customer to take your system. It doesn't take many of these to get senior execs asking where the profit is. That is, if you look at this as an investment, what is the return on this investment? My argument is that the return is nearly to identically zero. My rationale for this argument comes from the fact that once a customer learns that someone else got a great deal, they also demand a similar deal. This is the segue to the NDA bit earlier. So, unless you hide the details of your sale, your margins will be impacted on nearly every sale. Does prestige translate into increased revenue? Lets ask on this list (self selecting, probably not statistically valid, but may give a rough picture): Question for the list members whom have bought (large-ish) clusters/HPC systems: Was your selection influenced by the heroic class systems sales? Did you purposefully buy from the same vendor because of this, or was this a significant contributing factor in your decision process? Feel free to answer offline and anonymously if you'd like (I'll post the question on http://scalability.org as well ... not a commercial site, no adverts there, and we already have quite a bit of daily traffic ... no astroturfing going on here). > Did IBM make a profit, I doubt it. Its another matter that this prestige > didn't give them enough mileage. It didn't start selling BG/L s like hot > cakes. It certainly gave them a boasting ground. Thats my point. Prestige doesn't normally translate into sales. Prestige gives you something to talk about, over that $4 USD cup of coffee from Starbucks. Put another way, who won the various races over the wilderness isn't likely to influence many SUV buyers as to whether they should pick a particular brand. Prestige is a talking point ... something like "hey, did you know ..." > The subsequent quotes were very high that they couldn't win the > contracts. I was once told by a reseller that IBM's higher-ups decided > against further discounts(they will need to start making money). :-) Yeah ... this happens. If you start buying the business (won't mention any vendor names here), pretty soon you reach a point where a senior VP or the CEO looks at the profit and loss for each division/group, and notices one little one ... these HPC folks ... are bleeding capital. Unless that bleeding (also called 'investment' above in a somewhat semi-euphemistic manner) can be turned around (also called 'return on investment' above in a somewhat semi-euphemistic manner), and they can start showing a profit, that exec is going to think twice about continuing that line of business. > So, the point here is that though prestige is ! = profit, it surely > helps their reputation. Absolutely. If Prentis and his team at IAS bought a huge storage cluster at a very low margin from us, it wouldn't likely translate to a sale somewhere else, even if we could use IAS's name (we couldn't). The prestige is a badge of honor, not a sales tool. > Richard. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Wed Jun 30 20:11:38 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Jun 2010 23:11:38 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B9E5F.2010602@ias.edu> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> Message-ID: <4C2C076A.5010703@scalableinformatics.com> Prentice Bisbal wrote: > I'd like to apologize to other beowulfers for going way off-topic. This > will be my last post on this topic. Actually given the light volume on the list, its not too bad ... and it is on topic in the business sense. At the end of the day, the fundamental question we are debating is, does the "prestige" of working with a top university/national lab have any real tangible value that you can ascribe to the bottom line, does it actually impact sales. I posit that the answer to this is a resounding "no". You obviously disagree. This is the business side of HPC. Its definitely relevant to beowulfery, which seeks to minimize cost per cycle. [...] >> Hmm .... see above. Did this media coverage inspire you to purchase a >> Blue Gene? Or an XT6? >> > > No because Roadrunner was not a Blue Gene system ;). Irrelevant to the argument. Did, the prestige of a particular system at the very high end induce you to buy a similar one? I don't think you answered affirmatively on this. > We need to look beyond Roadrunner selling more roadrunner-like systems > and Jaguar selling more Jaguars. Well, no. This is what was implied, that the prestige has follow on economic value. I posit it doesn't. > The success of Roadrunner and Deep Blue probably didn't sell more s/probably// > Roadrunners and Deep Blues, but I'm sure they had an effect on IBMs > stock price, and help sell lower-end IBM systems. IBM dominates the Top Well, here is where it gets murky. Can you, with any specificity, indicate what the impact upon IBM's stock price (e.g. increase in market valuation) selling a machine under its actual cost, had upon the company? I *can* point you to their bottom line and show you where that decreased by exactly the amount they may have lost in selling this machine (IBM is smart, they generally don't do business when they will lose money, they try to at least break even). You can *always* see the net impact of these sorts of "prestige" sales. Revenue increases, and profits stay flat. Like it or not, wall street punishes you when this happens. This means your gross and net margins drop. So if the stock price rose more than the net margins dropped ... then you *might* be able to ascribe value to that. The "I'm sure that..." doesn't fly here. Any argument that starts like that isn't going to win you friends in the financial community. The ones who do ascribe value to IBM's stock, and the price in the increased business risks associated with lower margins. > 500 right now. I'm sure their success with Roadrunner, Deep Blue, and > Blue Gene have something to do with that. Again, see above. This is not likely to be the case. > > If not direct technology transfer, I bet Bob at Acme thinks to himself > "IBM has done a lot of great things in supercomputing. They're > definitely the experts. I think we should hire them to build and > integrate a new 128-node cluster for our comp. chem group. We don't see that. Anyone on this list care to comment? This is a good question for the list: Did you buy an IBM because of Roadrunner? Did you buy a Cray because of Jaguar? Or are your purchase decisions largely a function of budget, suitability to purpose, technological considerations, and ... how big of a discount you got? I suspect the latter. I don't think many folks were influenced to any significant degree by the heroic class systems, other than to say "cool". -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From Daniel.Pfenniger at unige.ch Wed Jun 30 21:06:12 2010 From: Daniel.Pfenniger at unige.ch (Pfenniger Daniel) Date: Thu, 01 Jul 2010 06:06:12 +0200 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2C003D.8000606@scalableinformatics.com> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu> <4C2BF56E.7090303@gmail.com> <4C2C003D.8000606@scalableinformatics.com> Message-ID: <4C2C1434.9030307@unige.ch> Joe Landman wrote: > Richard Chang wrote: > .... > Does prestige translate into increased revenue? Lets ask on this list > (self selecting, probably not statistically valid, but may give a rough > picture): In non-US wealthy countries the Top 500 list is a powerful argument to get HPC hardware from governmental funding agencies. The country is too low on the list according to the national ego? Then for sure there will be some additional money for correcting the disgrace. And then the hardware will be purchased at list price. > ... Dan From james.p.lux at jpl.nasa.gov Wed Jun 30 21:24:48 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed, 30 Jun 2010 21:24:48 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2C076A.5010703@scalableinformatics.com> Message-ID: On 6/30/10 8:11 PM, "Joe Landman" wrote: > > > Or are your purchase decisions largely a function of budget, suitability > to purpose, technological considerations, and ... how big of a discount > you got? > > I suspect the latter. I don't think many folks were influenced to any > significant degree by the heroic class systems, other than to say "cool". > I can posit that such heroic systems might convince your upper management to allow you to buy *any* cluster, especially if it's from the "itty bitty monopoly" But as Joe points out, when it actually comes to buying, the gimlet eyes of the green eyeshade brigade will be cast over the bids. It's all about cost (whether capital or life cycle). Nobody is going to pay more just because the vendor did a stunt of one sort or another. From forum.san at gmail.com Wed Jun 30 21:51:09 2010 From: forum.san at gmail.com (Sangamesh B) Date: Thu, 1 Jul 2010 10:21:09 +0530 Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux machine? Message-ID: Dear All, We're in a process of implementing a centralized FlexLM license server for multiple commercial applications. Can some one tell us, whether Linux OS support multiple lmgrd services or not? If its not directly, is there a way to do it? For example, can we install FlexLM license servers of both ANSYS and STAR CD on a single linux server? Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Wed Jun 30 22:21:47 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 1 Jul 2010 01:21:47 -0400 (EDT) Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux machine? In-Reply-To: References: Message-ID: > Linux OS support multiple lmgrd services or not? If its not directly, is > there a way to do it? I don't really understand what you're asking. yes, linux provides fully functional TCP/IP. yes, flexlm can run either with a merged license file (single base port, multiple vendor ports), or with multiple completely separate instances (listening on say, ports 27000+27001 and 28000+28001). the latter is often more convenient, since it means you can adjust one instance without affecting the other. From hahn at mcmaster.ca Wed Jun 30 23:20:41 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 1 Jul 2010 02:20:41 -0400 (EDT) Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: References: Message-ID: > (whether capital or life cycle). Nobody is going to pay more just because > the vendor did a stunt of one sort or another. I agree: vendor stunts are advertising. but think of them as like a mating display - elaborate feathers or a big rack of antlers. they make a claim of fitness that speaks mainly to a customer's risk-aversion: if IBM/Cray/etc can make some giant cluster work, then surely our little cluster project will succeed. if you have more in-house expertise, you may not value this as much. in a sense, this factor is anti-beowulf, since the expectation for really commoditized parts is that they'll Just Work. with some modest care, you can be pretty confident that the software stack will Just Work. especially with open-source, which provides greater access and fixability. so most of the value of brand boils down to hardware/firmware-level issues that customers are not well-equipped to deal with those, either at bid-eval time or once the deal is done. my perception, though, is that vendors try to pretend such problems don't happen, rather than bragging about how well they solve them... regards, mark hahn. From bill at cse.ucdavis.edu Wed Jun 30 23:24:00 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 30 Jun 2010 23:24:00 -0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <20100630055026.GF28068@bx9.net> References: <20100630055026.GF28068@bx9.net> Message-ID: <4C2C3480.2000206@cse.ucdavis.edu> On 06/29/2010 10:50 PM, Greg Lindahl wrote: > On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote: > >> The Top500 list has many useful metrics but I didn't see any $$ based >> metrics there. > > Other communities with $$-based metrics haven't had much success with > them. > > In HPC, many contracts are multi-year, multi-delivery, or, they > include significant extra stuff beyond the iron. Just have the vendor provide a list of included equipment, and a price with the stipulation that anyone that wants that list of equipment gets it for exactly that price. Maybe include some low level of service like equipment replacement via return to depot for 3 years. So vendors would work out nice discounts for their favorite customers, and the customer could brag to their bosses about how much under the retail price they got. From dmitri.chubarov at gmail.com Wed Jun 30 01:23:24 2010 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Wed, 30 Jun 2010 15:23:24 +0700 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <20100630055026.GF28068@bx9.net> References: <20100630055026.GF28068@bx9.net> Message-ID: Hello, This metric however misleading it might be, is sometimes used in press releases. Here is just one that I could find by googling (Google translation of a Russian original). http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.pcmag.ru%2Fnews%2Fdetail.php%3FID%3D6234&sl=ru&tl=en It says: "unique for the Russian market price / performance ratio of the complex: in the light of the full cost of the solution of 158K USD for 1 teraflop of peak performance." (The text is dated 27.02.2007) The price per teraflops is determined by the components cost, mainly the CPU and the memory, so looking at the Intel, AMD, whatever price lists per socket and subsequent extrapolation is all one needs to get this metric. Though if you would factor the infrastructure costs in, the picture gets much more complicated. --dc On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote: > The Top500 list has many useful metrics but I didn't see any $$ based > metrics there. > From bibil.thaysose at gmail.com Wed Jun 30 08:50:27 2010 From: bibil.thaysose at gmail.com (Greg Rubino) Date: Wed, 30 Jun 2010 11:50:27 -0400 Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500? In-Reply-To: <4C2B64BA.3000704@ias.edu> References: <4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com> <4C2B64BA.3000704@ias.edu> Message-ID: I have to say I partially agree with Prentice. I don't know if prestige directly translates into revenue, but if your a huge company and your platform is the first one upon which some new innovation in HPC is implemented (cutthroat or not), you have a huge opportunity on your hands. I guess it depends upon the terms under which you took that initial "loss" (s/loss/risk/g). >> >> Prestige does not translate into (profitable) revenue. >> > > I think you meant to say "Prestige does not always translate into > profitable revenue. > From akshar.bhosale at gmail.com Wed Jun 30 12:32:09 2010 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Thu, 1 Jul 2010 01:02:09 +0530 Subject: [Beowulf] guide for pbs/torque and mpi Message-ID: hi, we want to have a good reference guide for torque(pbs),maui and mpi akshar -------------- next part -------------- An HTML attachment was scrubbed... URL: