From gmpc at sanger.ac.uk  Tue Jun  1 05:03:15 2010
From: gmpc at sanger.ac.uk (Guy Coates)
Date: Tue, 01 Jun 2010 13:03:15 +0100
Subject: [Beowulf] cluster scheduler for dynamic tree-structured jobs?
In-Reply-To: <20100515102454.GA99295@piskorski.com>
References: <20100515102454.GA99295@piskorski.com>
Message-ID: <4C04F703.9010700@sanger.ac.uk>

On 15/05/10 11:24, Andrew Piskorski wrote:
> Folks, I could use some advice on which cluster job scheduler (batch
> queuing system) would be most appropriate for my particular needs.
> I've looked through docs for SGE, Slurm, etc., but without first-hand
> experience with each one it's not at all clear to me which I should
> choose...
> 

This may be late in the day but...
If you job dependencies are too complicated for you queuing system to
deal with, you may want to look at the Ensembl Hive system;

http://www.ensembl.org/info/docs/eHive/index.html

It is the system we use in-house for our genome-analysis pipelines,
which have lots of complicated dependencies. It sits on top of a
traditional queuing system which handles job-dispatch etc.

It has been de-coupled from the genome analysis workflow, so you should
(in theory) be able to use it for any analysis.

Cheers,

Guy

-- 
Dr. Guy Coates, Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From rreis at aero.ist.utl.pt  Tue Jun  1 23:44:41 2010
From: rreis at aero.ist.utl.pt (Ricardo Reis)
Date: Wed, 2 Jun 2010 07:44:41 +0100 (WEST)
Subject: [Beowulf] Top 500 in the BBC
Message-ID: <alpine.DEB.2.00.1006020744240.26233@localhost>


http://infosthetics.com/archives/2010/05/bbc_news_visualizing_the_top_500_supercomputer_report.html

best regards,

  Ricardo Reis

  'Non Serviam'

  PhD candidate @ Lasef
  Computational Fluid Dynamics, High Performance Computing, Turbulence
  http://www.lasef.ist.utl.pt

  Cultural Instigator @ R?dio Zero
  http://www.radiozero.pt

  Keep them Flying! Ajude a/help Aero F?nix!

  http://www.aeronauta.com/aero.fenix

  http://www.flickr.com/photos/rreis/

                            < sent with alpine 2.00 >

From sabujp at gmail.com  Thu Jun  3 20:01:23 2010
From: sabujp at gmail.com (Sabuj Pattanayek)
Date: Thu, 3 Jun 2010 22:01:23 -0500
Subject: [Beowulf] recommendations for parallel IO
In-Reply-To: <alpine.DEB.2.00.1005261457570.30398@localhost>
References: <alpine.DEB.2.00.1005261457570.30398@localhost>
Message-ID: <AANLkTin9FDrX1hOmGErMUAvl47tOYq_gSuLRNtKxbzWl@mail.gmail.com>

> ?We need an open source solution, we are looking into PVFS and Gluster (but
> from what we see, Gluster doesn't quit fit the bill? It's more a distributed
> filesystem than a parallel filesystem... or are we taking the wrong turn on
> our reasoning, somewhere about this?)

gluster in stripe mode is parallel. It can also be distributed or
distributed and parallel, mirrored, etc.


From pal at di.fct.unl.pt  Fri Jun  4 04:45:05 2010
From: pal at di.fct.unl.pt (Paulo Afonso Lopes)
Date: Fri, 4 Jun 2010 12:45:05 +0100 (WEST)
Subject: [Beowulf] recommendations for parallel IO
In-Reply-To: <alpine.DEB.2.00.1005261457570.30398@localhost>
References: <alpine.DEB.2.00.1005261457570.30398@localhost>
Message-ID: <61056.193.136.122.17.1275651905.squirrel@webmail.fct.unl.pt>

Oi, Ricardo.

>
>   Hi all
>
>   We have a small cluster but some users need to use MPI-IO. We have a
> NFS3
> shared partition but you would need to mount it with special options who
> would hurt performance.

Yes... options include all the available ways to enforce "no client
caching" and that is (usually) very bad for performance :-)

There's also NFS4.1 but I can't speak about it other than the last time (>
6 months) I looked, it was VERY OS dependent (you had to run kernel
2.6.x.y.z); furthermore, I haven't looked at the MPI-IO support status on
4.1.

> We are looking into a nice parallel file system to
> deploy in this context. We got 4 boxes with a 500Gb disk in each, for the

Are the 4 boxes just for the filesystem service, or are they "the small
cluster" ?

> moment, connected with Gb. We have another Gb connection dedicated to the
> MPI traffic.
>
>   We need an open source solution, we are looking into PVFS

I am using it and I have very good experiences with PVFS: easy
installation, support -- !excellent! -- and good performance; the only
minuses are 1) CPU use when you used GbE "dumb" cards (those usually
integrated in the mobo) and 2) some limitations on the POSIX interface
(which should not hurt you, as you're going the MPI-IO way).

If the 4 nodes are both the compute and I/O nodes, then (1) above will
hurt your applications iff they overlap I/O and computation.


>  and Gluster
> (but from what we see, Gluster doesn't quit fit the bill? It's more a
> distributed filesystem than a parallel filesystem... or are we taking the
> wrong turn on our reasoning, somewhere about this?)
>
Yes, you're right, me thinks :-) However, I have no experience

You could also look at Lustre. Back in the days (of CFS, Inc.) where there
were 2 versions, the free and the not-free, the free was a nightmare to
install (been there), had quite a few bugs, and was always months behind
the non-free, but I am told that when Sun picked it they changed that, and
there is only one version (altough there are some mumbles about Oracle's
doing this and that)

Abra?o

paulo


-- 
Paulo Afonso Lopes                        | Tel: +351- 21 294 8536
Departamento de Inform?tica               | 294 8300 ext.10702
Faculdade de Ci?ncias e Tecnologia        | Fax: +351- 21 294 8541
Universidade Nova de Lisboa               | e-mail: poral at fct.unl.pt
2829-516 Caparica, PORTUGAL


From gus at ldeo.columbia.edu  Fri Jun  4 09:43:35 2010
From: gus at ldeo.columbia.edu (Gus Correa)
Date: Fri, 04 Jun 2010 12:43:35 -0400
Subject: [Beowulf] Top 500 in the BBC
In-Reply-To: <alpine.DEB.2.00.1006020744240.26233@localhost>
References: <alpine.DEB.2.00.1006020744240.26233@localhost>
Message-ID: <4C092D37.9060303@ldeo.columbia.edu>

Ola' Ricardo,

That's really nice!
Here's the BBC link also:
http://news.bbc.co.uk/2/hi/10187248.stm

Na~o ha' nada como o ra'dio para a difusa~o da informaca~o!

Abrac,o
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Ricardo Reis wrote:
> 
> http://infosthetics.com/archives/2010/05/bbc_news_visualizing_the_top_500_supercomputer_report.html 
> 
> 
> best regards,
> 
>  Ricardo Reis
> 
>  'Non Serviam'
> 
>  PhD candidate @ Lasef
>  Computational Fluid Dynamics, High Performance Computing, Turbulence
>  http://www.lasef.ist.utl.pt
> 
>  Cultural Instigator @ R?dio Zero
>  http://www.radiozero.pt
> 
>  Keep them Flying! Ajude a/help Aero F?nix!
> 
>  http://www.aeronauta.com/aero.fenix
> 
>  http://www.flickr.com/photos/rreis/
> 
>                            < sent with alpine 2.00 >
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at caltech.edu  Tue Jun  8 10:44:55 2010
From: mathog at caltech.edu (David Mathog)
Date: Tue, 08 Jun 2010 10:44:55 -0700
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>

This is off topic so I will try to keep it short:  is there an
"archival" format for large binary files which contains enough error
correction to that all original data may be recovered even if there is a
little data loss in the storage media?  

For my purposes these are disk images, sometimes .tar.gz, other times
gunzip -c of dd dumps of whole partitions which have been "cleared" by
filling the empty space with one big file full of zero, and then that
file deleted.  I'm thinking of putting this information on DVD's (only
need to keep it for a few years at a time) but I don't trust that media
not to lose a sector here or there - having watched far too many
scratched DVD movies with playback problems.

Unlike an SDLT with a bad section, the good parts of a DVD are still
readable when there is a bad block (using dd or ddrescue) but of course
even a single missing chunk makes it impossible to decompress a .gz file
correctly.  So what I'm looking for is some sort of .img.gz.ecc format,
where the .ecc puts in enough redundant information to recover the
underlying img.gz even when sectors or data are missing.   If no such
tool/format exists then two copies should be enough to recover all of an
.img.gz so long as the same data wasn't lost on both media, and if bad
DVD sectors always come back as "failed read", never ever showing up as
a good read but actually containing bad data.  Perhaps the frame
checksum on a DVD is enough to guarantee that?

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From mdidomenico4 at gmail.com  Tue Jun  8 11:05:19 2010
From: mdidomenico4 at gmail.com (Michael Di Domenico)
Date: Tue, 8 Jun 2010 14:05:19 -0400
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <AANLkTinH1sI3zu7RaFTQJR3D9dLoH8WuwmSphis5DHC6@mail.gmail.com>

What's the ramification of losing a block?  (ie file-system won't
mount, data has a hole)

Not that it's elegant, the first thing that pops to mind is using
'split' to chunk the file into many little bits and then md5 each bit


On Tue, Jun 8, 2010 at 1:44 PM, David Mathog <mathog at caltech.edu> wrote:
> This is off topic so I will try to keep it short: ?is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there is a
> little data loss in the storage media?
>
> For my purposes these are disk images, sometimes .tar.gz, other times
> gunzip -c of dd dumps of whole partitions which have been "cleared" by
> filling the empty space with one big file full of zero, and then that
> file deleted. ?I'm thinking of putting this information on DVD's (only
> need to keep it for a few years at a time) but I don't trust that media
> not to lose a sector here or there - having watched far too many
> scratched DVD movies with playback problems.
>
> Unlike an SDLT with a bad section, the good parts of a DVD are still
> readable when there is a bad block (using dd or ddrescue) but of course
> even a single missing chunk makes it impossible to decompress a .gz file
> correctly. ?So what I'm looking for is some sort of .img.gz.ecc format,
> where the .ecc puts in enough redundant information to recover the
> underlying img.gz even when sectors or data are missing. ? If no such
> tool/format exists then two copies should be enough to recover all of an
> .img.gz so long as the same data wasn't lost on both media, and if bad
> DVD sectors always come back as "failed read", never ever showing up as
> a good read but actually containing bad data. ?Perhaps the frame
> checksum on a DVD is enough to guarantee that?
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From reuti at staff.uni-marburg.de  Tue Jun  8 12:03:59 2010
From: reuti at staff.uni-marburg.de (Reuti)
Date: Tue, 8 Jun 2010 21:03:59 +0200
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <230D2E83-0269-4512-B9F1-54D273327888@staff.uni-marburg.de>

Hi,

Am 08.06.2010 um 19:44 schrieb David Mathog:

> This is off topic so I will try to keep it short:  is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there  
> is a
> little data loss in the storage media?
>
> For my purposes these are disk images, sometimes .tar.gz, other times
> gunzip -c of dd dumps of whole partitions which have been "cleared" by
> filling the empty space with one big file full of zero, and then that
> file deleted.  I'm thinking of putting this information on DVD's (only
> need to keep it for a few years at a time) but I don't trust that  
> media
> not to lose a sector here or there - having watched far too many
> scratched DVD movies with playback problems.
>
> Unlike an SDLT with a bad section, the good parts of a DVD are still
> readable when there is a bad block (using dd or ddrescue) but of  
> course
> even a single missing chunk makes it impossible to decompress a .gz  
> file
> correctly.  So what I'm looking for is some sort of .img.gz.ecc  
> format,
> where the .ecc puts in enough redundant information to recover the
> underlying img.gz even when sectors or data are missing.   If no such
> tool/format exists then two copies should be enough to recover all  
> of an
> .img.gz so long as the same data wasn't lost on both media, and if bad
> DVD sectors always come back as "failed read", never ever showing up  
> as
> a good read but actually containing bad data.  Perhaps the frame
> checksum on a DVD is enough to guarantee that?

besides splitting the file, I would suggest to generate some par/par2  
files. This format was originally used on the Usene, to have a  
reliable way to transfer binary attachements. I.e. first you split  
your files into e.g. 10 pieces each and generate 5 par/par2 files for  
each of them. Then you need any 10 out of these 15 into total to be  
good to recover the original file.

http://en.wikipedia.org/wiki/Parchive

-- Reuti


> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From beckerjes at mail.nih.gov  Tue Jun  8 17:49:57 2010
From: beckerjes at mail.nih.gov (Jesse Becker)
Date: Tue, 8 Jun 2010 20:49:57 -0400
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <20100609004957.GM26589@mail.nih.gov>

I came across this page a few years back that discusses this very
problem:

    http://users.softlab.ntua.gr/~ttsiod/rsbep.html


On Tue, Jun 08, 2010 at 01:44:55PM -0400, David Mathog wrote:
>This is off topic so I will try to keep it short:  is there an
>"archival" format for large binary files which contains enough error
>correction to that all original data may be recovered even if there is a
>little data loss in the storage media?  
>
>For my purposes these are disk images, sometimes .tar.gz, other times
>gunzip -c of dd dumps of whole partitions which have been "cleared" by
>filling the empty space with one big file full of zero, and then that
>file deleted.  I'm thinking of putting this information on DVD's (only
>need to keep it for a few years at a time) but I don't trust that media
>not to lose a sector here or there - having watched far too many
>scratched DVD movies with playback problems.
>
>Unlike an SDLT with a bad section, the good parts of a DVD are still
>readable when there is a bad block (using dd or ddrescue) but of course
>even a single missing chunk makes it impossible to decompress a .gz file
>correctly.  So what I'm looking for is some sort of .img.gz.ecc format,
>where the .ecc puts in enough redundant information to recover the
>underlying img.gz even when sectors or data are missing.   If no such
>tool/format exists then two copies should be enough to recover all of an
>.img.gz so long as the same data wasn't lost on both media, and if bad
>DVD sectors always come back as "failed read", never ever showing up as
>a good read but actually containing bad data.  Perhaps the frame
>checksum on a DVD is enough to guarantee that?
>
>Thanks,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)


From kilian.cavalotti.work at gmail.com  Wed Jun  9 00:33:16 2010
From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI)
Date: Wed, 9 Jun 2010 09:33:16 +0200
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <AANLkTinH1sI3zu7RaFTQJR3D9dLoH8WuwmSphis5DHC6@mail.gmail.com>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
	<AANLkTinH1sI3zu7RaFTQJR3D9dLoH8WuwmSphis5DHC6@mail.gmail.com>
Message-ID: <AANLkTiniqjsumuw8OI92G4hYR-8hjJzID_0ovmkowNUV@mail.gmail.com>

On Tue, Jun 8, 2010 at 8:05 PM, Michael Di Domenico
<mdidomenico4 at gmail.com> wrote:
> Not that it's elegant, the first thing that pops to mind is using
> 'split' to chunk the file into many little bits and then md5 each bit

While this may let you know that a file has been corrupted, it won't
help recovering that file.

Some compression algorithms, which may be considered as storage
algorithms if you turn compression off, have options to create
recovery records. For instance, in the RAR format
(http://en.wikipedia.org/wiki/RAR), you can choose how much redundant
data you want to include in your archive (whose size will be increased
accordingly).

Excerpt from Alexander Roshal's rar user's manual:

"""
    rr[N]   Add data recovery record. Optionally, redundant information
            (recovery record) may be added to an archive. This will cause
            a small increase of the archive size and helps to recover
            archived files in case of floppy disk failure or data losses of
            any other kind. A recovery record contains up to 524288 recovery
            sectors. The number of sectors may be specified directly in the
            'rr' command (N = 1, 2 .. 524288) or, if it is not specified by
            the user, it will be selected automatically according to the
            archive size: a size of the recovery information will be about
            1% of the total archive size, usually allowing the recovery of
            up to 0.6% of the total archive size of continuously damaged data.

            It is also possible to specify the recovery record size in
            percent to the archive size. Just append the percent character
            to the command parameter. For example:

            rar rr3% arcname

            If data is damaged continuously, then each rr-sector helps to
            recover 512 bytes of damaged information. This value may be
            lower in cases of multiple damage.
"""

Cheers,
-- 
Kilian


From mathog at caltech.edu  Thu Jun 10 12:20:39 2010
From: mathog at caltech.edu (David Mathog)
Date: Thu, 10 Jun 2010 12:20:39 -0700
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>

Jesse Becker and others suggested:

>     http://users.softlab.ntua.gr/~ttsiod/rsbep.html

I tried it and it works, mostly, but definitely has some warts.

To start with I gave it a negative control - a file so badly corrupted
it should NOT have been able to recover it.

% ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig
% cat img.orig      | bzip2 >img.bz2.orig
% cat img.bz2.orig  | rsbep > img.bz2.rsbep
% cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000
>img.bz2.rsbep.pox
% cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored
rsbep: number of corrected failures   : 9725096
rsbep: number of uncorrectable blocks : 0

img.orig is a Windows XP partition with all empty space filled with
0x0 bytes.  That is then compressed with bzip2, then run
through rsbep (the one from the link above), then corrupted
with pockmark.  Pockmark is my own little concoction, when used as
shown  it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
a run of (1-MAXRUN).  In both cases the gap and run length are chosen at
random from those ranges for each new gap/run.
This should corrupt around 10% of the file, which I assumed would render
it unrecoverable.  Notice in the file sizes below that the overall size
did not change when the file was run through pockmark.  rsbep did not
note any errors it couldn't correct. However, the
size of the restored file is not the same as the orig.

 4056976560 2010-06-08 17:51 img.bz2.restored
 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox
 4639143600 2010-06-08 16:13 img.bz2.rsbep
 4056879025 2010-06-08 14:40 img.bz2.orig
20974431744 2010-06-07 15:23 img.orig

% bunzip2 -tvv img.bz2.restored
  img.bz2.restored: 
    [1: huff+mtf data integrity (CRC) error in data

So at the very least rsbep sometimes says it has recovered a file when
it has not.  I didn't really expect it to rescue this particular input,
but it really should have handled it better.   I reran it with a less
damaged file like this:


% cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000
>img.bz2.rsbep.pox2
% cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2
rsbep: number of corrected failures   : 46025036
rsbep: number of uncorrectable blocks : 0
% bunzip2 img.bz2.restored2
bunzip2: Can't guess original name for img.bz2.restored2 -- using
img.bz2.restored2.out
bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
% md5sum img.bz2.restored2.out img.orig
7fbaec7143c3a17a31295a803641aa3c  img.bz2.restored2.out
7fbaec7143c3a17a31295a803641aa3c  img.orig

This time it was able to recover the corrupted file, but again, it
created an output file which was a different size.  Is this always the
case?   Seems to be at least for the size file used here:

% cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2

nopox.bz2 is also 4056976560.   The decoded output is always 97535 bytes
larger than the original, which may bear some relation to the
z=ERR_BURST_LEN parameter as:

 97535 /765 = 127.496732

which is suspiciously close to 255/2.  Or that could just be a coincidence.

In any case, bunzip2 was able to handle the crud on the end, but this
would have been a problem for other binary files.

Tbe other thing that is frankly bizarre is the number of "corrected"
failures for the 2nd case vs. the first.    The 2nd should have 10X
fewer bad bytes than the first, but the rsbep status messages
indicate 4.73X MORE.  However, the number of bad bytes in the 2nd is
almost exactly 1%, as it should be.  All of this suggests that rsbep
does not handle correctly files which are "too" corrupted.  It gives the
wrong number of corrected blocks and thinks that it has corrected
everything when it has not done so.  Worse, even when it does work the
output file was never (in any of the test cases) the same size as the
input file.

I think this program has potential but it needs a bit of work to sand
the rough edges off.  I will have a look at it, but won't have a chance
to do so for a couple of weeks.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From atchley at myri.com  Thu Jun 10 13:11:36 2010
From: atchley at myri.com (Scott Atchley)
Date: Thu, 10 Jun 2010 16:11:36 -0400
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>
References: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>
Message-ID: <5E6C3D74-439A-41EE-8CE6-FAF5CE8888C7@myri.com>

On Jun 10, 2010, at 3:20 PM, David Mathog wrote:

> Jesse Becker and others suggested:
> 
>>    http://users.softlab.ntua.gr/~ttsiod/rsbep.html
> 
> I tried it and it works, mostly, but definitely has some warts.
> 
> To start with I gave it a negative control - a file so badly corrupted
> it should NOT have been able to recover it.
> 
> % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig
> % cat img.orig      | bzip2 >img.bz2.orig
> % cat img.bz2.orig  | rsbep > img.bz2.rsbep
> % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000
>> img.bz2.rsbep.pox
> % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored
> rsbep: number of corrected failures   : 9725096
> rsbep: number of uncorrectable blocks : 0
> 
> img.orig is a Windows XP partition with all empty space filled with
> 0x0 bytes.  That is then compressed with bzip2, then run
> through rsbep (the one from the link above), then corrupted
> with pockmark.  Pockmark is my own little concoction, when used as
> shown  it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
> a run of (1-MAXRUN).  In both cases the gap and run length are chosen at
> random from those ranges for each new gap/run.
> This should corrupt around 10% of the file, which I assumed would render
> it unrecoverable.  Notice in the file sizes below that the overall size
> did not change when the file was run through pockmark.  rsbep did not
> note any errors it couldn't correct. However, the
> size of the restored file is not the same as the orig.
> 
> 4056976560 2010-06-08 17:51 img.bz2.restored
> 4639143600 2010-06-08 16:19 img.bz2.rsbep.pox
> 4639143600 2010-06-08 16:13 img.bz2.rsbep
> 4056879025 2010-06-08 14:40 img.bz2.orig
> 20974431744 2010-06-07 15:23 img.orig
> 
> % bunzip2 -tvv img.bz2.restored
>  img.bz2.restored: 
>    [1: huff+mtf data integrity (CRC) error in data
> 
> So at the very least rsbep sometimes says it has recovered a file when
> it has not.  I didn't really expect it to rescue this particular input,
> but it really should have handled it better.

I have never used this tool, but I would wonder if your pockmark tool damaged the rsbep metadata, specifically one or more of the metadata segment lengths. Bear in mind that corruption of the metadata is not beyond the realm of possibility, but I assume that the rsbep metadata is not replicated or otherwise protected.

> I reran it with a less damaged file like this:
> 
> % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000
>> img.bz2.rsbep.pox2
> % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2
> rsbep: number of corrected failures   : 46025036
> rsbep: number of uncorrectable blocks : 0
> % bunzip2 img.bz2.restored2
> bunzip2: Can't guess original name for img.bz2.restored2 -- using
> img.bz2.restored2.out
> bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
> % md5sum img.bz2.restored2.out img.orig
> 7fbaec7143c3a17a31295a803641aa3c  img.bz2.restored2.out
> 7fbaec7143c3a17a31295a803641aa3c  img.orig
> 
> This time it was able to recover the corrupted file, but again, it
> created an output file which was a different size.  Is this always the
> case?   Seems to be at least for the size file used here:
> 
> % cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2
> 
> nopox.bz2 is also 4056976560.   The decoded output is always 97535 bytes
> larger than the original, which may bear some relation to the
> z=ERR_BURST_LEN parameter as:
> 
> 97535 /765 = 127.496732
> 
> which is suspiciously close to 255/2.  Or that could just be a coincidence.
> 
> In any case, bunzip2 was able to handle the crud on the end, but this
> would have been a problem for other binary files.

This is most likely a requirement of the underlying Reed-Solomon library that requires equal length blocksizes. If your original file is N bytes and N % M != 0 where M is the blocksize, I imagine it pads the last block with 0s so that it is N bytes. It should not affect bunzip since the length is encoded in the file and it ignores anything tacked onto the end.

A quick glance at his website, it claims that the length should be the same. He only shows, however, the md5sums and not the ls -l output.

Scott

> Tbe other thing that is frankly bizarre is the number of "corrected"
> failures for the 2nd case vs. the first.    The 2nd should have 10X
> fewer bad bytes than the first, but the rsbep status messages
> indicate 4.73X MORE.  However, the number of bad bytes in the 2nd is
> almost exactly 1%, as it should be.  All of this suggests that rsbep
> does not handle correctly files which are "too" corrupted.  It gives the
> wrong number of corrected blocks and thinks that it has corrected
> everything when it has not done so.  Worse, even when it does work the
> output file was never (in any of the test cases) the same size as the
> input file.
> 
> I think this program has potential but it needs a bit of work to sand
> the rough edges off.  I will have a look at it, but won't have a chance
> to do so for a couple of weeks.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mathog at caltech.edu  Thu Jun 10 14:43:21 2010
From: mathog at caltech.edu (David Mathog)
Date: Thu, 10 Jun 2010 14:43:21 -0700
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <E1OMpWz-00032f-2v@mendel.bio.caltech.edu>

Scott Atchley wrote:
> I have never used this tool, but I would wonder if your pockmark tool
damaged the rsbep metadata, specifically one or more of the metadata
segment lengths. Bear in mind that corruption of the metadata is not
beyond the realm of possibility, but I assume that the rsbep metadata is
not replicated or otherwise protected.

pockmark just stomps on random parts of the file, so the metadata is as
open to destruction as anything else.  Presumably that shouldn't be an
issue for this sort of program though - the metadata should also be
protected in some manner.

> > In any case, bunzip2 was able to handle the crud on the end, but this
> > would have been a problem for other binary files.
> 
> This is most likely a requirement of the underlying Reed-Solomon
library that requires equal length blocksizes. If your original file is
N bytes and N % M != 0 where M is the blocksize, I imagine it pads the
last block with 0s so that it is N bytes. It should not affect bunzip
since the length is encoded in the file and it ignores anything tacked
onto the end.

bunzip2 id not affected, but it is not a good thing to do in general. 
Not all binary files will be functionally equivalent after null bytes
are added on the end!

> 
> A quick glance at his website, it claims that the length should be the
same. He only shows, however, the md5sums and not the ls -l output.

I forwarded my observations to the program's author and suggested that
if I ran the program incorrectly, or he finds these really are bugs,
that he post back here with corrections.

I tried rsbep again with a test file of size 81920000 bytes (much less
than 32bits unsigned, the first test file was larger in bytes than 32
bits unsigned) but similar problems arose.  One difference, for the
smaller test file the restored files were 240842 bytes bigger, not 97535
like before.

My guess is that since the program dates back to the age of very small
media it may be using "int" or "long" in locations where "long long" is
needed today.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From dnlombar at ichips.intel.com  Thu Jun 10 15:11:56 2010
From: dnlombar at ichips.intel.com (David N. Lombard)
Date: Thu, 10 Jun 2010 15:11:56 -0700
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>
References: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>
Message-ID: <20100610221155.GA30172@nlxcldnl2.cl.intel.com>

On Thu, Jun 10, 2010 at 12:20:39PM -0700, David Mathog wrote:
> Jesse Becker and others suggested:
> 
> >     http://users.softlab.ntua.gr/~ttsiod/rsbep.html
> 
> I tried it and it works, mostly, but definitely has some warts.
> 
> To start with I gave it a negative control - a file so badly corrupted
> it should NOT have been able to recover it.
> 
> % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig
> % cat img.orig      | bzip2 >img.bz2.orig
> % cat img.bz2.orig  | rsbep > img.bz2.rsbep
> % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000
> >img.bz2.rsbep.pox
> % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored
> rsbep: number of corrected failures   : 9725096
> rsbep: number of uncorrectable blocks : 0
> 
> img.orig is a Windows XP partition with all empty space filled with
> 0x0 bytes.  That is then compressed with bzip2, then run
> through rsbep (the one from the link above), then corrupted
> with pockmark.  Pockmark is my own little concoction, when used as
> shown  it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
> a run of (1-MAXRUN).  In both cases the gap and run length are chosen at
> random from those ranges for each new gap/run.

The website is more interested in corrupted block media, with the assumption
said corruption manifests as a cluster of invalid blocks from the file.  You've
got a different type of corruption.

> % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000
> >img.bz2.rsbep.pox2
> % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2
> rsbep: number of corrected failures   : 46025036
> rsbep: number of uncorrectable blocks : 0
> % bunzip2 img.bz2.restored2
> bunzip2: Can't guess original name for img.bz2.restored2 -- using
> img.bz2.restored2.out
> bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
> % md5sum img.bz2.restored2.out img.orig
> 7fbaec7143c3a17a31295a803641aa3c  img.bz2.restored2.out
> 7fbaec7143c3a17a31295a803641aa3c  img.orig

He documents a "freeze.sh" and "melt.sh" (in the contrib dir) that wrap rsbep
and rsbep_chopper.  That's very different from what you did.

[dnl at closter ~]$ ls -l junk.avi
-rw-rw-r-- 1 dnl dnl 12622344 2010-02-28 19:17 junk.avi
[dnl at closter ~]$ freeze junk.avi  > freeze1
[dnl at closter ~]$ melt freeze1 > melt1
[dnl at closter ~]$ md5sum freeze1 junk.avi melt1 
4f8052c358e5bd86b9bfffd980726940  junk.avi
dcbeafa75ec60f50d003876866009213  freeze1
4f8052c358e5bd86b9bfffd980726940  melt1
[dnl at closter ~]$ ls -l junk.avi freeze1 melt1 
-rw-rw-r-- 1 dnl dnl 14565600 2010-06-10 14:56 freeze1
-rw-rw-r-- 1 dnl dnl 12622344 2010-02-28 19:17 junk.avi
-rw-rw-r-- 1 dnl dnl 12622344 2010-06-10 14:56 melt1
[dnl at closter ~]$ 

The above only shows a single example of it not damaging an intact file.
But I only played with it for about 10m last night and the above is a
laugh test.

I'll start proper testing tonight...

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From james.p.lux at jpl.nasa.gov  Thu Jun 10 16:39:00 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Thu, 10 Jun 2010 16:39:00 -0700
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <20100610221155.GA30172@nlxcldnl2.cl.intel.com>
References: <E1OMnIt-0002x7-5L@mendel.bio.caltech.edu>
	<20100610221155.GA30172@nlxcldnl2.cl.intel.com>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FEDA6929773@ALTPHYEMBEVSP20.RES.AD.JPL>


> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David N. Lombard
> Sent: Thursday, June 10, 2010 3:12 PM
> To: David Mathog
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] OT: recoverable optical media archive format?
> 
> 
> The website is more interested in corrupted block media, with the assumption
> said corruption manifests as a cluster of invalid blocks from the file.  You've
> got a different type of corruption.
> 


In the communications field one uses interleaving as a way to turn burst errors into isolated errors.  One then uses a short length coding scheme to detect and correct the isolated errors.

On systems where the errors are sporadic (communication links with additive white Gaussian noise, or RAM that gets sporadic single bit flips), then interleaving doesn't buy you anything, and you can use short codes like Hamming or Reed-Solomon/BCH.

On systems where errors are transient (read errors from flash.. read the same location again a second time and it's ok) or where  you can ask for a repeat, then block oriented "go back N" codes work well.   Unless you have a timing determinism requirement or the retry interval is very long (8 hour light time to Pluto), and then, some sort of redundant block scheme (send every block three times in a row) gets used.

A good practical example is the coding used on CDs... the error correcting code is a Reed Solomon, which does real well at isolated errors, but not great at burst errors.  So they use a R-S code with a block interleave scheme in front of it and then another R-S code, because the error statistics from CD drives show burst errors.


So it looks like you are looking for the inverse.. something that turns distributed errors into clumps?


From Tina.Friedrich at diamond.ac.uk  Fri Jun  4 01:39:38 2010
From: Tina.Friedrich at diamond.ac.uk (Tina Friedrich)
Date: Fri, 04 Jun 2010 09:39:38 +0100
Subject: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
In-Reply-To: <20100525194056.GB16022@kaizen.mayo.edu>
References: <201005251900.o4PJ0ElP016422@bluewest.scyld.com>
	<20100525194056.GB16022@kaizen.mayo.edu>
Message-ID: <4C08BBCA.5070508@diamond.ac.uk>

We've had that happen on some of our servers. Currently using the 
disable_msi workaround, which seems to have stopped it. I believe 
there's supposed to be a fix in the latest Red Hat kernel but we haven't 
really tested that yet.

You loose all network connectivity (including IPMI) to the server - not 
all connectivity, so e.g. serial console (not SOL, proper serial 
console, or using a console server) still works (as would a locally 
attached keyboard/monitor). Unless you require network to log in :) . If 
one runs into this, it's a really weird one (before you find the bug 
report) - to all appearances, the server works happily, no strangeness 
in the logs - just network gone completely.

It's not one to trigger easily - hard to track down sort of thing. Had 
610s and 710s for a while before this first happened (and loads we never 
saw it on, still). We first saw it on a rather heavily used NFS server 
(i.e. lots of network I/O).

Tina


Cris Rhea wrote:
>> In case it helps anyone using Dell R410 / 610 / 710 etc. servers: I have had
>> machines lose their eth connections periodically (CentOS 5.4 bnx2 driver).
>> Seems like a bug with the Broadcom NIC drivers. [luckily read of it on a
>> Dell mailing list]
>>
>> Bug Reports:
>>
>> http://kbase.redhat.com/faq/docs/DOC-26837
>> http://patchwork.ozlabs.org/patch/51106
>>
>> Not sure yet if this is exactly my issue but I'm giving it a shot now.
>> Thought I'd post since, anecdotally I've seen many people use these servers
>> on the list.
>>
>> -- 
>> Rahul
> 
> I've been following this on the Dell list as I have approx. 50 R410s  
> in our cluster.
> 
> One thing that isn't clear--  When this happens, do you lose all 
> connectivity to the node (i.e., do you have to reboot the node to 
> re-establish eth0)?
> 
> My R410s are running CentOS 5.2 - 5.4 and I rarely have one go 
> down.
> 
> --- Cris
> 
> 


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442


From mb at gup.jku.at  Tue Jun  8 11:31:17 2010
From: mb at gup.jku.at (Markus Baumgartner)
Date: Tue, 08 Jun 2010 20:31:17 +0200
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <4C0E8C75.8020609@gup.jku.at>

   David Mathog wrote:
> This is off topic so I will try to keep it short:  is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there is a
> little data loss in the storage media?
>
>
You could check out http://dvdisaster.net/en/index.html


From rtomek at ceti.com.pl  Tue Jun  8 12:19:44 2010
From: rtomek at ceti.com.pl (Tomasz Rola)
Date: Tue, 8 Jun 2010 21:19:44 +0200 (CEST)
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <Pine.LNX.4.64.1006082107560.12122@tau.ceti.pl>

On Tue, 8 Jun 2010, David Mathog wrote:

> This is off topic so I will try to keep it short:  is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there is a
> little data loss in the storage media?  
> 
> For my purposes these are disk images, sometimes .tar.gz, other times
> gunzip -c of dd dumps of whole partitions which have been "cleared" by
> filling the empty space with one big file full of zero, and then that
> file deleted.  I'm thinking of putting this information on DVD's (only
> need to keep it for a few years at a time) but I don't trust that media
> not to lose a sector here or there - having watched far too many
> scratched DVD movies with playback problems.
> 
> Unlike an SDLT with a bad section, the good parts of a DVD are still
> readable when there is a bad block (using dd or ddrescue) but of course
> even a single missing chunk makes it impossible to decompress a .gz file
> correctly.  So what I'm looking for is some sort of .img.gz.ecc format,
> where the .ecc puts in enough redundant information to recover the
> underlying img.gz even when sectors or data are missing.   If no such
> tool/format exists then two copies should be enough to recover all of an
> .img.gz so long as the same data wasn't lost on both media, and if bad
> DVD sectors always come back as "failed read", never ever showing up as
> a good read but actually containing bad data.  Perhaps the frame
> checksum on a DVD is enough to guarantee that?

I use tar, gzip/bzip2, split - for creating a number of files of more or 
less similar lenghts (like, 50 megs or 100 megs, but usually 50).

After that, I make par2 recovery files with par2cmdline tools (they make
use of Solomon-Reed error correction)

http://en.wikipedia.org/wiki/Parchive
http://parchive.sourceforge.net/

I am unable to find par2cmdline via google ATM, but they should be 
somewhere.

And last but not least, I burn it all (data + pars). HTH.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home    **
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **


From dkimpe at mcs.anl.gov  Wed Jun  9 04:05:13 2010
From: dkimpe at mcs.anl.gov (Dries Kimpe)
Date: Wed, 9 Jun 2010 11:05:13 +0000
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <20100609110513.GA4359@X300.rhi.hi.is>

* David Mathog <mathog at caltech.edu> [2010-06-08 10:44:55]:

> This is off topic so I will try to keep it short:  is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there is a
> little data loss in the storage media?  

> For my purposes these are disk images, sometimes .tar.gz, other times
> gunzip -c of dd dumps of whole partitions which have been "cleared" by
> filling the empty space with one big file full of zero, and then that
> file deleted.  I'm thinking of putting this information on DVD's (only
> need to keep it for a few years at a time) but I don't trust that media
> not to lose a sector here or there - having watched far too many
> scratched DVD movies with playback problems.

> Unlike an SDLT with a bad section, the good parts of a DVD are still
> readable when there is a bad block (using dd or ddrescue) but of course
> even a single missing chunk makes it impossible to decompress a .gz file
> correctly.  So what I'm looking for is some sort of .img.gz.ecc format,
> where the .ecc puts in enough redundant information to recover the
> underlying img.gz even when sectors or data are missing.   If no such
> tool/format exists then two copies should be enough to recover all of an
> .img.gz so long as the same data wasn't lost on both media, and if bad
> DVD sectors always come back as "failed read", never ever showing up as
> a good read but actually containing bad data.  Perhaps the frame
> checksum on a DVD is enough to guarantee that?

You should also consider protecting the metadata of the filesystem; I.e.
what good does it do to have split files, correction data, ... if it
cannot find the file any longer because the damaged sector was in the
directory metadata, not in the actual file data?

RAR has 'recovery record' support that is tunable (you can pick how much
space you want to sacrifice to recovery). You could pack everything in a rar
file (with recovery records turned on) and write the whole file
directly to the dvd (i.e. using dd or growisofs -Z /dev/dvd=rarfile).

The downside is that the filesize will not be preserved, you'd have to
check if unrar can deal with this or if it requires the file size to be
known. A quick test with a small rar archive seems to indicate that it
does not care if extra data is added at the end of the file.

   Dries

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100609/3327be74/attachment.sig>

From stuart at cyberdelix.net  Wed Jun  9 07:17:01 2010
From: stuart at cyberdelix.net (lsi)
Date: Wed, 09 Jun 2010 15:17:01 +0100
Subject: [Beowulf] OT: recoverable optical media archive format?
In-Reply-To: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
References: <E1OM2r9-0001QB-K4@mendel.bio.caltech.edu>
Message-ID: <4C0FA25D.949.37AA4C7A@stuart.cyberdelix.net>

The ECC approach is nice.  My current solution is to burn two copies 
of each archive DVD.  If the media deteriorates and there are 
unreadable sectors, I use the second copy of the DVD to replace the 
dead sectors.

This is done using software which has a Bad Sector Mapping function, 
and a Patch File function.  I have used Media Doctor to do this job.  
I wrote it up here:

http://www.cyberdelix.net/tech/recover_cd_dvd.htm

This said, I want to dump DVD as an archive format (I find that only 
certain drives will read certain DVDs, total PITA), I'm considering 
using HDDs, possibly 2.5" in a RAID configuration, in a NAS which is 
only fired up when needed.  I suspect nowadays, 2.5" drives are more 
space-efficient, although I haven't done the sums.

Stu

On 8 Jun 2010 at 10:44, David Mathog wrote:

> This is off topic so I will try to keep it short:  is there an
> "archival" format for large binary files which contains enough error
> correction to that all original data may be recovered even if there is a
> little data loss in the storage media?  


---
Stuart Udall
stuart at at cyberdelix.dot net - http://www.cyberdelix.net/

--- 
 * Origin: lsi: revolution through evolution (192:168/0.2)


From ttsiodras at gmail.com  Fri Jun 11 10:16:22 2010
From: ttsiodras at gmail.com (Thanassis Tsiodras)
Date: Fri, 11 Jun 2010 20:16:22 +0300
Subject: [Beowulf] Re: rsbep issues
In-Reply-To: <E1OMnR6-0002xV-NJ@mendel.bio.caltech.edu>
References: <E1OMnR6-0002xV-NJ@mendel.bio.caltech.edu>
Message-ID: <AANLkTilvOU3XbbGxV71PYxAFsecJSJ6T6ew-ixE65b_t@mail.gmail.com>

I don't know if adding the beowulf mailing list address to CC will
work - if it doesn't, please forward this response on my behalf.

Your question abou the output size:

Due to the mechanics of reed-solomon, if we want to be able to recover
from sector errors, our input split must be (a) split in large
sections and (b) interleaved. You can read the relevant theory from my
"Algorithm" section on my rsbep page
(http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html) where I explain
why we need to interleave the stream...

So, rsbep, on its own, creates outputs of multiples of 1040400.

This won't do, of course - we don't want to see garbage after our
"removal of shield" (piped to our gzip / bzip2 / lzma / whatever) ...

So what does my package do?
It installs two helper scripts (freeze.sh / melt.sh) which add a
simple header on the stream, BEFORE shielding it with Reed-Solomon, so
that the decoding side uses this size to "chop" off the extra cruft at
the end...

So if you use the tools as described in the packaged README, and as
described on the site (that is, via freeze.sh/melt.sh) then you won't
see the "garbage at the end" problem.

The other thing you report, about the "silent corruption", is serious.

If you check my configure.ac, you'll see that it checks for GCC version:

# Check for bad GCC version (4.4.x create bad code for rs32.c)
AX_GCC_VERSION
if test `expr substr $GCC_VERSION 1 3` == "4.4" ; then
        AC_MSG_ERROR([GCC Series 4.4.x generate bad code... Please use
4.3.x instead])
fi

Unfortunately, it's not just 4.4 - something has changed after GCC 4.3
that breaks the rsbep code... You can either use the 4.3.x series, or
compile in plain-C mode (hardcode X86ASM to "no"). I use Debian, where
the stable GCC version is 4.3.4, and therefore I don't see this bug...

I updated autoconf check to stop the build if it detects GCC >= 4.4.

That's all I can do for now... if you have the time/resources to
figure out why the code broke after this GCC version, I'd happily
publish your patches... (I use Debian, where the stable GCC is 4.3.4,
and this problem doesn't manifest):

bash$ dd if=/dev/urandom of=data bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 20.0936 s, 5.2 MB/s
bash$ freeze.sh data > data.shielded
bash$ dd if=/dev/zero of=data.shielded conv=notrunc bs=512 count=127
bash$ melt.sh data.shielded | md5sum
rsbep: number of corrected failures   : 64784
rsbep: number of uncorrectable blocks : 0
b55c7886465da082c4949698858d342c  -
bash$ cat data | md5sum
b55c7886465da082c4949698858d342c  -

So the data were recovered fine, even after a loss of 127 consecutive sectors.

I hope this helps...

Kind regards,
Thanassis Tsiodras, Dr.-Ing.

On Thu, Jun 10, 2010 at 10:29 PM, David Mathog <mathog at caltech.edu> wrote:
> Hi,
>
> I posted in the beowulf mailing list (beowulf at beowulf.org) asking for
> suggestions for a program that would ?allow recovery from corrupted
> media, and your rsbep variant was suggested. ?So I gave it a try, and it
> had some issues with the test data I ran. ?Possibly because the files
> were so large?
>
> It was built on a Mandriva 2008.1 system with:
>
> ./configure --prefix=/usr/common
> make
> make install
>
> I posted my observations to the beowulf mailing list, but am including a
> copy of them below my signature. ?Perhaps you might want to respond to
> that list with corrections to whatever I did wrong, or a notice of a
> patched version of the program, if these really are bugs.
>
> I will send you a copy of the pockmark program in a separate email.
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
> ------------- Forwarded message follows -------------
>
> Jesse Becker and others suggested:
>
>> ? ? http://users.softlab.ntua.gr/~ttsiod/rsbep.html
>
> I tried it and it works, mostly, but definitely has some warts.
>
> To start with I gave it a negative control - a file so badly corrupted
> it should NOT have been able to recover it.
>
> % ssh remotePC 'dd if=/dev/sda1 bs=8192' >img.orig
> % cat img.orig ? ? ?| bzip2 >img.bz2.orig
> % cat img.bz2.orig ?| rsbep > img.bz2.rsbep
> % cat img.bz2.rsbep | pockmark -maxgap 100000 -maxrun 10000
>>img.bz2.rsbep.pox
> % cat img.bz2.rsbep.pox | rsbep -d -v >img.bz2.restored
> rsbep: number of corrected failures ? : 9725096
> rsbep: number of uncorrectable blocks : 0
>
> img.orig is a Windows XP partition with all empty space filled with
> 0x0 bytes. ?That is then compressed with bzip2, then run
> through rsbep (the one from the link above), then corrupted
> with pockmark. ?Pockmark is my own little concoction, when used as
> shown ?it stamps 0x0 bytes starting randomly every (1-MAXGAP) bytes, for
> a run of (1-MAXRUN). ?In both cases the gap and run length are chosen at
> random from those ranges for each new gap/run.
> This should corrupt around 10% of the file, which I assumed would render
> it unrecoverable. ?Notice in the file sizes below that the overall size
> did not change when the file was run through pockmark. ?rsbep did not
> note any errors it couldn't correct. However, the
> size of the restored file is not the same as the orig.
>
> ?4056976560 2010-06-08 17:51 img.bz2.restored
> ?4639143600 2010-06-08 16:19 img.bz2.rsbep.pox
> ?4639143600 2010-06-08 16:13 img.bz2.rsbep
> ?4056879025 2010-06-08 14:40 img.bz2.orig
> 20974431744 2010-06-07 15:23 img.orig
>
> % bunzip2 -tvv img.bz2.restored
> ?img.bz2.restored:
> ? ?[1: huff+mtf data integrity (CRC) error in data
>
> So at the very least rsbep sometimes says it has recovered a file when
> it has not. ?I didn't really expect it to rescue this particular input,
> but it really should have handled it better. ? I reran it with a less
> damaged file like this:
>
>
> % cat img.bz2.rsbep | pockmark -maxgap 1000000 -maxrun 10000
>>img.bz2.rsbep.pox2
> % cat img.bz2.rsbep.pox2 | rsbep -d -v >img.bz2.restored2
> rsbep: number of corrected failures ? : 46025036
> rsbep: number of uncorrectable blocks : 0
> % bunzip2 img.bz2.restored2
> bunzip2: Can't guess original name for img.bz2.restored2 -- using
> img.bz2.restored2.out
> bunzip2: img.bz2.restored2: trailing garbage after EOF ignored
> % md5sum img.bz2.restored2.out img.orig
> 7fbaec7143c3a17a31295a803641aa3c ?img.bz2.restored2.out
> 7fbaec7143c3a17a31295a803641aa3c ?img.orig
>
> This time it was able to recover the corrupted file, but again, it
> created an output file which was a different size. ?Is this always the
> case? ? Seems to be at least for the size file used here:
>
> % cat img.bz2.orig | rsbep | rsbep -d > nopox.bz2
>
> nopox.bz2 is also 4056976560. ? The decoded output is always 97535 bytes
> larger than the original, which may bear some relation to the
> z=ERR_BURST_LEN parameter as:
>
> ?97535 /765 = 127.496732
>
> which is suspiciously close to 255/2. ?Or that could just be a coincidence.
>
> In any case, bunzip2 was able to handle the crud on the end, but this
> would have been a problem for other binary files.
>
> Tbe other thing that is frankly bizarre is the number of "corrected"
> failures for the 2nd case vs. the first. ? ?The 2nd should have 10X
> fewer bad bytes than the first, but the rsbep status messages
> indicate 4.73X MORE. ?However, the number of bad bytes in the 2nd is
> almost exactly 1%, as it should be. ?All of this suggests that rsbep
> does not handle correctly files which are "too" corrupted. ?It gives the
> wrong number of corrected blocks and thinks that it has corrected
> everything when it has not done so. ?Worse, even when it does work the
> output file was never (in any of the test cases) the same size as the
> input file.
>
> I think this program has potential but it needs a bit of work to sand
> the rough edges off. ?I will have a look at it, but won't have a chance
> to do so for a couple of weeks.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
>
>
>
>
>


-- 
What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph


From eugen at leitl.org  Mon Jun 14 06:22:09 2010
From: eugen at leitl.org (Eugen Leitl)
Date: Mon, 14 Jun 2010 15:22:09 +0200
Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM
Message-ID: <20100614132209.GG1964@leitl.org>


http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html 

</kraut>

-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE


From jlforrest at berkeley.edu  Mon Jun 14 08:49:33 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 14 Jun 2010 08:49:33 -0700
Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea?
Message-ID: <4C164F8D.60301@berkeley.edu>

I have a cluster up and running that uses
those SuperMicro Twin boxes, which have
2 nodes per rack unit, with each node
using 2 AMD 6-core Istanbuls. This results
in 12 cores per node, or 24 cores per rack unit.
This is working fine.

Now that the 12-core AMD processors are out,
I was hoping that I could get the same configuration,
except using 12-core processors, and yielding
48 cores per rack unit. The problem is, as of
right now, I believe such boxes aren't available
yet. The closest thing is a 4-way 1U box, which
gives 48 cores per rack unit, but in *1 node*.

My intuition tells me that I should be wary of
such a configuration because of various SMP-related
locking and concurrency issues. There probably aren't
many single node 48 core boxes out there so there
might be surprises. I don't like surprises.

The obvious thing to do would be to wait until
the Twin boxes come out but my problem is that
I have money to spend that has to be spent soon,
maybe before the Twin boxes come out. So, I'm trying
to decide what to do. (I only want 1U boxes because
I have to pay for rack space).

Any advice?

Cordially,
-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From ntmoore at gmail.com  Mon Jun 14 09:22:34 2010
From: ntmoore at gmail.com (Nathan Moore)
Date: Mon, 14 Jun 2010 11:22:34 -0500
Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM
In-Reply-To: <AANLkTilok2-Lns7SNxQOPIED8CKpqG6VDQ6GPqqgIbXe@mail.gmail.com>
References: <20100614132209.GG1964@leitl.org>
	<AANLkTilok2-Lns7SNxQOPIED8CKpqG6VDQ6GPqqgIbXe@mail.gmail.com>
Message-ID: <AANLkTikZMCcDLtyc_w5xXygGhFlKtmy6ixhHXqk-eGRF@mail.gmail.com>

I should have said "fancy ALU/FPU" no co-processor.  Oh well.

On Mon, Jun 14, 2010 at 11:21 AM, Nathan Moore <ntmoore at gmail.com> wrote:
> Sounds like a BlueGene without the fancy math co-processor and with a
> more normal os...
>
> On Mon, Jun 14, 2010 at 8:22 AM, Eugen Leitl <eugen at leitl.org> wrote:
>>
>> http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html
>>
>> </kraut>
>>
>> --
>> Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
>> ______________________________________________________________
>> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
>> 8B29F6BE: 099D 78BA 2FD3 B014 B08A ?7779 75B0 2443 8B29 F6BE
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
>
> --
> - - - - - - - ? - - - - - - - ? - - - - - - -
> Nathan Moore
> Assistant Professor, Physics
> Winona State University
> AIM: nmoorewsu
> - - - - - - - ? - - - - - - - ? - - - - - - -
>


-- 
- - - - - - -   - - - - - - -   - - - - - - -
Nathan Moore
Assistant Professor, Physics
Winona State University
AIM: nmoorewsu
- - - - - - -   - - - - - - -   - - - - - - -


From ntmoore at gmail.com  Mon Jun 14 09:21:44 2010
From: ntmoore at gmail.com (Nathan Moore)
Date: Mon, 14 Jun 2010 11:21:44 -0500
Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM
In-Reply-To: <20100614132209.GG1964@leitl.org>
References: <20100614132209.GG1964@leitl.org>
Message-ID: <AANLkTilok2-Lns7SNxQOPIED8CKpqG6VDQ6GPqqgIbXe@mail.gmail.com>

Sounds like a BlueGene without the fancy math co-processor and with a
more normal os...

On Mon, Jun 14, 2010 at 8:22 AM, Eugen Leitl <eugen at leitl.org> wrote:
>
> http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html
>
> </kraut>
>
> --
> Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
> ______________________________________________________________
> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
> 8B29F6BE: 099D 78BA 2FD3 B014 B08A ?7779 75B0 2443 8B29 F6BE
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
- - - - - - -   - - - - - - -   - - - - - - -
Nathan Moore
Assistant Professor, Physics
Winona State University
AIM: nmoorewsu
- - - - - - -   - - - - - - -   - - - - - - -


From jlforrest at berkeley.edu  Mon Jun 14 10:51:04 2010
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Mon, 14 Jun 2010 10:51:04 -0700
Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea?
In-Reply-To: <Pine.LNX.4.64.1006141315340.11471@coffee.psychology.mcmaster.ca>
References: <4C164F8D.60301@berkeley.edu>
	<Pine.LNX.4.64.1006141315340.11471@coffee.psychology.mcmaster.ca>
Message-ID: <4C166C08.8070600@berkeley.edu>

On 6/14/2010 10:29 AM, Mark Hahn wrote:
>> right now, I believe such boxes aren't available
>> yet. The closest thing is a 4-way 1U box, which
>> gives 48 cores per rack unit, but in *1 node*.
>
> well, the supermicro website lists them:
> http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HTRF.cfm
> http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HIBQRF.cfm

Those are both 2U boxes. I was hoping to find
a 1U box since they cost less to co-locate.

>> My intuition tells me that I should be wary of
>> such a configuration because of various SMP-related
>> locking and concurrency issues.
>
> why? is there something peculiar about your workload, and especially
> something that would show up with modestly higher SMPness?

Nothing specific. I'm worried about latency and locking
issues that only pop out when larger numbers of cores
are used.

> this is hardly uncharted territory. SGI's been there forever,
> and some fringe boxes from Intel. but 8s 4c has been pretty mundane
> for a while, and doesn't need any sort of hand-holding. unless you
> mean something like "I expect to swap a lot and want to configure a
> single non-raid swap partition", I don't really see what you're worrying
> about...

SGI isn't mainstream, and probably doesn't use the
same chipset and motherboards that SuperMicro
will be selling.

> I think people should actually take fresh look at 4s 1U boxes
> because AMD has eliminated the "4-socket penalty". there are some
> nontrivial advantages to fatter nodes - they let you achieve some
> unique workload configurations (bigger memory, higher-threaded, etc).
> sysadmin work doesn't scale linearly as the number of nodes, of course,
> but having fewer, fatter nodes can be attractive TCO-wise, too.

If SuperMicro doesn't come up with Twin boxes, I might
be forced to follow your advice. I'm not concerned
about sysadmin work, because I'm using Rocks. I'm more
concerned about ending up in the Twilight Zone where
things aren't as they appear.

-- 
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From hahn at mcmaster.ca  Mon Jun 14 10:29:15 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 14 Jun 2010 13:29:15 -0400 (EDT)
Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea?
In-Reply-To: <4C164F8D.60301@berkeley.edu>
References: <4C164F8D.60301@berkeley.edu>
Message-ID: <Pine.LNX.4.64.1006141315340.11471@coffee.psychology.mcmaster.ca>

> right now, I believe such boxes aren't available
> yet. The closest thing is a 4-way 1U box, which
> gives 48 cores per rack unit, but in *1 node*.

well, the supermicro website lists them:
http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HTRF.cfm
http://www.supermicro.com/Aplus/system/2U/2022/AS-2022TG-HIBQRF.cfm

> My intuition tells me that I should be wary of
> such a configuration because of various SMP-related
> locking and concurrency issues.

why?  is there something peculiar about your workload, and especially
something that would show up with modestly higher SMPness?

> There probably aren't
> many single node 48 core boxes out there so there
> might be surprises. I don't like surprises.

this is hardly uncharted territory.  SGI's been there forever,
and some fringe boxes from Intel.  but 8s 4c has been pretty mundane
for a while, and doesn't need any sort of hand-holding.  unless you
mean something like "I expect to swap a lot and want to configure 
a single non-raid swap partition", I don't really see what you're 
worrying about...

> The obvious thing to do would be to wait until
> the Twin boxes come out but my problem is that
> I have money to spend that has to be spent soon,
> maybe before the Twin boxes come out. So, I'm trying
> to decide what to do. (I only want 1U boxes because
> I have to pay for rack space).

I think people should actually take fresh look at 4s 1U boxes
because AMD has eliminated the "4-socket penalty".  there are some 
nontrivial advantages to fatter nodes - they let you achieve some
unique workload configurations (bigger memory, higher-threaded, etc).
sysadmin work doesn't scale linearly as the number of nodes, of course,
but having fewer, fatter nodes can be attractive TCO-wise, too.

regards, mark hahn.


From hearnsj at googlemail.com  Tue Jun 15 02:30:40 2010
From: hearnsj at googlemail.com (John Hearns)
Date: Tue, 15 Jun 2010 10:30:40 +0100
Subject: [Beowulf] 48-Core X86_64 Compute Node - Good Idea?
In-Reply-To: <4C166C08.8070600@berkeley.edu>
References: <4C164F8D.60301@berkeley.edu>
	<Pine.LNX.4.64.1006141315340.11471@coffee.psychology.mcmaster.ca>
	<4C166C08.8070600@berkeley.edu>
Message-ID: <AANLkTikZSp8tJ8afDxqcG9BAby78cOKRBptAPHtrHHF-@mail.gmail.com>

On 14 June 2010 18:51, Jon Forrest <jlforrest at berkeley.edu> wrote:
>>
> SGI isn't mainstream, and probably doesn't use the
> same chipset and motherboards that SuperMicro
> will be selling.
>
Take a close, close look inside those ICE blade enclosures.

You are of course correct though - IA64 type Altixes and Ultraviolten
are definitely not mainstream!


From eugen at leitl.org  Tue Jun 15 02:41:55 2010
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 15 Jun 2010 11:41:55 +0200
Subject: [Beowulf] 10 U = 512x Atom Z530 + 2 GByte RAM
In-Reply-To: <07FA933F-62AB-491C-80FF-C8C1D8906537@divination.biz>
References: <20100614132209.GG1964@leitl.org>
	<07FA933F-62AB-491C-80FF-C8C1D8906537@divination.biz>
Message-ID: <20100615094155.GU1964@leitl.org>

On Mon, Jun 14, 2010 at 10:14:16PM -0400, Douglas J. Trainor wrote:
> in Englisch:
> 
>     http://www.eweek.com/c/a/IT-Infrastructure/SeaMicro-Uses-Intel-Atom-Chip-in-Server-Architecture-745338/
> 
> Quote: "Feldman said the motherboard is shrunk from the size of a pizza box to that of a credit card."

It's an interesting device. It would take >6 racks of those 300 EUR Supermicro Atom servers at
a >200 kEUR price tag to match their 10 U.

Though of course one could just use the 200 EUR motherboards on custom trays, and share the PSUs.

Though one would then just stick with plain GBit Ethernet,
and not their custom fabric.
 
>     douglas
> 
> On Jun 14, 2010, at 9:22 AM, Eugen Leitl wrote:
> 
> > 
> > http://www.heise.de/newsticker/meldung/Cloud-Server-mit-Intel-Atom-und-spaeter-auch-ARM-Prozessoren-1021400.html 
> > 
> > </kraut>
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE


From ttsiodras at gmail.com  Sat Jun 12 18:39:33 2010
From: ttsiodras at gmail.com (Thanassis Tsiodras)
Date: Sun, 13 Jun 2010 04:39:33 +0300
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <AANLkTimTRlhto0gVaG79gEKnFget9TKOxhUjeHmJ-kqL@mail.gmail.com>

I am the author of the updated "rsbep" package that was mentioned in
this thread, and I was contacted by David Mathog (just "mathog"
henceforth) about the issues he had.

"mathog" reported a difference in the decoded output sizes, when
directly using "rsbep" - but as mentioned by David N. Lombard, the
usage scenario that "mathog" followed was not a sanctioned one: both
my site's article as well as the package instructions (README)
referred to the "freeze"/"melt" scripts, that decode the shielded data
into the correct output size.

The reason for what mathog experienced is a bit complex, but clearly
explained on my site
(http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html), and it boils
down to this: in order to withstand storage errors, an interleaving of
the Reed-Solomon encoded data has to take place.

Basically, the x86 ASM code of Reed-Solomon that I "inherited" from
the original "rsbep" and use in my package, adds 16 bytes of parity
data to each block of 223 bytes of input, turning it to a 255-bytes
block. These parity bytes allow detection and correction of 16 errors
(in the encoded 255-byte block), as well as detection of 32 errors (in
the encoded 255-byte block). This however won't work for storage
media, since they work or fail on sector boundaries (512 bytes for
disks and 2048 bytes for CDs/DVDs) - so the encoded data are
interleaved by my package, inside blocks of 1040400 bytes (containing
4080 of the Reed-Solomon-encoded 255-byte blocks)... In this way, a
loss of a sector only impacts ONE byte in the 512 encoded "blocks"
that are passing through it (due to the interleaving)... If
interested, you can read more details on my page, where I explain how
the idea works.

The end result, is that
- the interleaved stream can lose 127 contiguous sectors (65024
contiguous bytes) and still be recoverable.
- the interleaved stream can lose 128-255 sectors, and detect the
error (and report it, but not fix it)
- Beyond that number of errors (which correspond, after
de-interleaving, to more than 32 bytes in the encoded 255 byte block),
the Reed-Solomon code is lost...  Given the interleaving that my
package performs on the encoded bytes, the only chance of this
happening, is losing a contiguous stream of more than 32x4080 bytes,
i.e. 130560 bytes. A storage error that causes this much loss (255
contiguous sectors!) is a lost cause anyway - at least as far as my
needs go. If you want to be able to recover from this or even larger
amounts of loss, you can do it, by increasing the block size from my
chosen 255x4080 (1040400 bytes) to something even bigger, and by
adapting my interleaving code (rsbep.c, "distribute" function).

To summarize, "mathog"'s pockmark app is not representative of what
happens in storage media - they NEVER fail on byte-levels - they fail
on sector levels.

So what should you do, if you want to be 100% sure of failure detection?

Simple:
By reviewing my freeze/melt scripts, you will see that all I do to the
"to-be-shielded-stream" is (a) add a "magic marker" and (b) add the
file size, so that "melt.sh" can chop the output down to the right
size. If you want bullet-proof checks, you can easily add the MD5 or
SHA sum of the input data, to the "to-be-shielded-stream", so that the
"melting" process can check this and be 100% certain in restoration or
detection of failure, even in the face of impossible stream corruption
(more than 130K lost).

Note however, that this is not necessary if you use an algorithm that
can detect errors in the decoded stream (which is how I use my rsbep -
i.e always on a stream generated by gzip, bzip2, etc)

Hope this clarifies things.

Kind regards,
Thanassis Tsiodras, Dr.-Ing.

-- 
What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph


From ttsiodras at gmail.com  Sun Jun 13 00:19:56 2010
From: ttsiodras at gmail.com (Thanassis Tsiodras)
Date: Sun, 13 Jun 2010 10:19:56 +0300
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <AANLkTikyDURyXBwf2CSQ5FU08sxb0tlGIxQYQXDtm6DF@mail.gmail.com>

I am the author of the updated "rsbep" package that was mentioned in
this thread, and I was contacted by David Mathog (just "mathog"
henceforth) about the issues he had.

"mathog" reported a difference in the decoded output sizes, when
directly using "rsbep" - but as mentioned by David N. Lombard, the
usage scenario that "mathog" followed was not a sanctioned one: both
my site's article as well as the package instructions (README)
referred to the "freeze"/"melt" scripts, that decode the shielded data
into the correct output size.

The reason for what mathog experienced is a bit complex, but clearly
explained on my site
(http://users.softlab.ece.ntua.gr/~ttsiod/rsbep.html), and it boils
down to this: in order to withstand storage errors, an interleaving of
the Reed-Solomon encoded data has to take place.

Basically, the x86 ASM code of Reed-Solomon that I "inherited" from
the original "rsbep" and use in my package, adds 16 bytes of parity
data to each block of 223 bytes of input, turning it to a 255-bytes
block. These parity bytes allow detection and correction of 16 errors
(in the encoded 255-byte block), as well as detection of 32 errors (in
the encoded 255-byte block). This however won't work for storage
media, since they work or fail on sector boundaries (512 bytes for
disks and 2048 bytes for CDs/DVDs) - so the encoded data are
interleaved by my package, inside blocks of 1040400 bytes (containing
4080 of the Reed-Solomon-encoded 255-byte blocks)... In this way, a
loss of a sector only impacts ONE byte in the 512 encoded "blocks"
that are passing through it (due to the interleaving)... If
interested, you can read more details on my page, where I explain how
the idea works.

The end result, is that
- the interleaved stream can lose 127 contiguous sectors (65024
contiguous bytes) and still be recoverable.
- the interleaved stream can lose 128-255 sectors, and detect the
error (and report it, but not fix it)
- Beyond that number of errors (which correspond, after
de-interleaving, to more than 32 bytes in the encoded 255 byte block),
the Reed-Solomon code is lost...  Given the interleaving that my
package performs on the encoded bytes, the only chance of this
happening, is losing a contiguous stream of more than 32x4080 bytes,
i.e. 130560 bytes. A storage error that causes this much loss (255
contiguous sectors!) is a lost cause anyway - at least as far as my
needs go. If you want to be able to recover from this or even larger
amounts of loss, you can do it, by increasing the block size from my
chosen 255x4080 (1040400 bytes) to something even bigger, and by
adapting my interleaving code (rsbep.c, "distribute" function).

To summarize, "mathog"'s pockmark app is not representative of what
happens in storage media - they NEVER fail on byte-levels - they fail
on sector levels.

So what should you do, if you want to be 100% sure of failure detection?

Simple:
By reviewing my freeze/melt scripts, you will see that all I do to the
"to-be-shielded-stream" is (a) add a "magic marker" and (b) add the
file size, so that "melt.sh" can chop the output down to the right
size. If you want bullet-proof validity checks, you can easily add the MD5 or
SHA sum of the input data, to the "to-be-shielded-stream", so that the
"melting" process can check this and be 100% certain in restoration or
detection of failure, even in the face of impossible stream corruption
(more than 130K lost).

Note however, that this is not necessary if you use an algorithm that
can detect errors in the decoded stream (which is how I use my rsbep -
i.e always on a stream generated by gzip, bzip2, etc)

Hope this clarifies things.

Kind regards,
Thanassis Tsiodras, Dr.-Ing.

-- 
What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph


From ttsiodras at gmail.com  Sun Jun 13 13:16:35 2010
From: ttsiodras at gmail.com (Thanassis Tsiodras)
Date: Sun, 13 Jun 2010 23:16:35 +0300
Subject: [Beowulf] OT: recoverable optical media archive format?
Message-ID: <AANLkTik4Ne-GqZ5bxKtsnlsqq0aHo5oyejZh8cJeH3Lz@mail.gmail.com>

The corruption of rsbep-protected data with GCC versions later than
4.3 has now been fixed.

I reviewed the code with valgrind, and it turned out that the code I
"inherited" from the original rsbep package performed out of bounds
read accesses in the "distribute" function... This was not in the
"core" Reed-solomon code, only in the interleaving function. I
re-wrote it - check rsbep.c, line 68 in the latest tarball. I also did
a clean-up of useless code that was never called.

The results work perfectly under all GCC versions I tried, under
Debian/32, Arch/32 and FreeBSD/64. Give it a try:
http://users.softlab.ntua.gr/~ttsiod/rsbep-0.1.0-ttsiodras.tar.bz2

So, to summarize, my response points on the original thread:

- The errors reported by David Mathog on the list had to do with
erroneous usage of the tools -  either the "freeze"/"melt" scripts
must be used, or "chopping" of the output has to be done via your own
custom code.
- The memory read access errors were fixed, so the tool works fine
with all GCC versions under all OSes.
- David's "pockmark" app is not representative of what happens in
storage media - they don't fail on byte-boundaries - they fail on
sector boundaries. My small contributions (freeze/melt scripts) on
rsbep make sure that even if we lose 127 contiguous 512-byte sectors,
we can still recover the data, at the exact original size.
- If you want bullet-proof checks, you can easily add the MD5 or SHA
sum of the input data, to the "to-be-shielded-stream", so that the
"melting" can be 100% certain of successful in detecting successful
restoration or data. This, however, is not necessary if you use an
algorithm that can detect errors in the decoded stream (gzip, bzip2,
etc)
- One final comment about PAR, which was suggested by others: since it
was designed for newsgroups, its recovery capabilities had other
(non-storage related) scope - for an "executive summary" read the last
of the comments I received when my rsbep became Slashdotted, here:
http://hardware.slashdot.org/hardware/08/08/03/197254.shtml

-- 
What I gave, I have; what I spent, I had; what I kept, I lost. -Old Epitaph


From andreas.de-blanche at hv.se  Mon Jun 14 00:32:18 2010
From: andreas.de-blanche at hv.se (Andreas de Blanche)
Date: Mon, 14 Jun 2010 09:32:18 +0200
Subject: [Beowulf] Survey how industrial companies use their HPC Resources
In-Reply-To: <AANLkTilvOU3XbbGxV71PYxAFsecJSJ6T6ew-ixE65b_t@mail.gmail.com>
References: <E1OMnR6-0002xV-NJ@mendel.bio.caltech.edu>
	<AANLkTilvOU3XbbGxV71PYxAFsecJSJ6T6ew-ixE65b_t@mail.gmail.com>
Message-ID: <4C15F722.7F5B.0055.1@hv.se>

Dear all,
I am sending out this survey on behalf of a master student of mine, he
would very much appreciate if you took his survey.
http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404


Best Regards
//Andreas de Blanche

*******************************************************************
I'm Wei He, a master student who study Computer Science at University
West in Sweden. I'm doing my thesis on
investigation of how industrial companies use their High performance
computing resources. I'm performing this
study with my two supervisors Dr. Linn Christiernin and Andreas de
Blanche. I know you have a wide range of
experience in the field.
It will take approximately 6-7 minutes to solve, please finish it
before June 30th. Your help in completing this
questionnaire is very much appreciated. The data collected is solely
for research purpose.
Thank you for participating in this questionnaire survey.
If you have two Computing Resource please take this survey twice.

http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404


Yours Sincearly
Wei He
Master Student at University West, Sweden
*******************************************************************


From deadline at eadline.org  Tue Jun 15 08:31:57 2010
From: deadline at eadline.org (Douglas Eadline)
Date: Tue, 15 Jun 2010 11:31:57 -0400 (EDT)
Subject: [Beowulf] Survey how industrial companies use their HPC Resources
In-Reply-To: <4C15F722.7F5B.0055.1@hv.se>
References: <E1OMnR6-0002xV-NJ@mendel.bio.caltech.edu>
	<AANLkTilvOU3XbbGxV71PYxAFsecJSJ6T6ew-ixE65b_t@mail.gmail.com>
	<4C15F722.7F5B.0055.1@hv.se>
Message-ID: <45740.192.168.1.213.1276615917.squirrel@mail.eadline.org>

Will the results be publicly available?

--
Doug

> Dear all,
> I am sending out this survey on behalf of a master student of mine, he
> would very much appreciate if you took his survey.
> http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404
>
>
> Best Regards
> //Andreas de Blanche
>
> *******************************************************************
> I'm Wei He, a master student who study Computer Science at University
> West in Sweden. I'm doing my thesis on
> investigation of how industrial companies use their High performance
> computing resources. I'm performing this
> study with my two supervisors Dr. Linn Christiernin and Andreas de
> Blanche. I know you have a wide range of
> experience in the field.
> It will take approximately 6-7 minutes to solve, please finish it
> before June 30th. Your help in completing this
> questionnaire is very much appreciated. The data collected is solely
> for research purpose.
> Thank you for participating in this questionnaire survey.
> If you have two Computing Resource please take this survey twice.
>
> http://FreeOnlineSurveys.com/rendersurvey.asp?sid=xrimbjplz1tr5kd772404
>
>
>
> Yours Sincearly
> Wei He
> Master Student at University West, Sweden
> *******************************************************************
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>


--
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


From lindahl at pbm.com  Tue Jun 15 13:12:26 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 15 Jun 2010 13:12:26 -0700
Subject: [Beowulf] Survey how industrial companies use their HPC Resources
In-Reply-To: <45740.192.168.1.213.1276615917.squirrel@mail.eadline.org>
References: <E1OMnR6-0002xV-NJ@mendel.bio.caltech.edu>
	<AANLkTilvOU3XbbGxV71PYxAFsecJSJ6T6ew-ixE65b_t@mail.gmail.com>
	<4C15F722.7F5B.0055.1@hv.se>
	<45740.192.168.1.213.1276615917.squirrel@mail.eadline.org>
Message-ID: <20100615201226.GF21791@bx9.net>

On Tue, Jun 15, 2010 at 11:31:57AM -0400, Douglas Eadline wrote:

> Will the results be publicly available?

At the end of the survey, you can put in an email address to receive a
copy of his finished thesis.

I noticed that the country list doesn't include the USA.

-- greg


From buccaneer at rocketmail.com  Wed Jun 16 17:08:16 2010
From: buccaneer at rocketmail.com (Buccaneer for Hire.)
Date: Wed, 16 Jun 2010 17:08:16 -0700 (PDT)
Subject: [Beowulf] Survey how industrial companies use their HPC Resources
In-Reply-To: <20100615201226.GF21791@bx9.net>
Message-ID: <814583.50541.qm@web30602.mail.mud.yahoo.com>

--- On Tue, 6/15/10, Greg Lindahl <lindahl at pbm.com> wrote:

> From: Greg Lindahl <lindahl at pbm.com>
>
> I noticed that the country list doesn't include the USA.
> 
I just made the assumption if you left it blank that it was understood. :)


From rigved.sharma123 at gmail.com  Wed Jun 16 18:47:18 2010
From: rigved.sharma123 at gmail.com (rigved sharma)
Date: Thu, 17 Jun 2010 07:17:18 +0530
Subject: [Beowulf] tracejob command shows error
Message-ID: <AANLkTik3LELG52Xj89w7kD1BPMaNQA4kYLSi8C5_f2Rq@mail.gmail.com>

hi
we are having cluster of 16 nodes and torque and maui installed on it. we
have just migrated from torque 2.3.6  to torque 2.4.8.but tracejob command
is not working.

/usr/spool/PBS/server_priv/accounting/20100617: No matching job records
located
/usr/spool/PBS/server_logs/20100617: No matching job records located
/usr/spool/PBS/mom_logs/20100617: No such file or directory
/usr/spool/PBS/sched_logs/20100617: No such file or directory
*** glibc detected *** tracejob: malloc(): memory corruption:
0x0000000003a74140 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3845871cd1]
/lib64/libc.so.6(__libc_malloc+0x7d)[0x3845872e8d]
/lib64/libc.so.6(popen+0x23)[0x3845862a63]
tracejob[0x401218]
tracejob[0x401bcf]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x384581d8b4]
tracejob[0x400e09]
======= Memory map: ========
00400000-00403000 r-xp 00000000 68:05 6094878
/usr/spool/PBS/bin/tracejob
00603000-00604000 rw-p 00003000 68:05 6094878
/usr/spool/PBS/bin/tracejob
03a74000-03a95000 rw-p 03a74000 00:00 0
3844800000-384481a000 r-xp 00000000 68:02 896307
/lib64/ld-2.5.so
3844a1a000-3844a1b000 r--p 0001a000 68:02 896307
/lib64/ld-2.5.so
3844a1b000-3844a1c000 rw-p 0001b000 68:02 896307
/lib64/ld-2.5.so
3845800000-384594a000 r-xp 00000000 68:02 896308
/lib64/libc-2.5.so
384594a000-3845b49000 ---p 0014a000 68:02 896308
/lib64/libc-2.5.so
3845b49000-3845b4d000 r--p 00149000 68:02 896308
/lib64/libc-2.5.so
3845b4d000-3845b4e000 rw-p 0014d000 68:02 896308
/lib64/libc-2.5.so
3845b4e000-3845b53000 rw-p 3845b4e000 00:00 0
3849c00000-3849c0d000 r-xp 00000000 68:02 896314
/lib64/libgcc_s-4.1.2-20080102.so.1
3849c0d000-3849e0d000 ---p 0000d000 68:02 896314
/lib64/libgcc_s-4.1.2-20080102.so.1
3849e0d000-3849e0e000 rw-p 0000d000 68:02 896314
/lib64/libgcc_s-4.1.2-20080102.so.1
2abb01fde000-2abb01fe0000 rw-p 2abb01fde000 00:00 0
2abb01fe0000-2abb02009000 r-xp 00000000 68:05 6072097
/usr/spool/PBS/lib/libtorque.so.2.0.0
2abb02009000-2abb02209000 ---p 00029000 68:05 6072097
/usr/spool/PBS/lib/libtorque.so.2.0.0
2abb02209000-2abb0220b000 rw-p 00029000 68:05 6072097
/usr/spool/PBS/lib/libtorque.so.2.0.0
2abb0220b000-2abb022ee000 rw-p 2abb0220b000 00:00 0
2abb0231a000-2abb0231b000 rw-p 2abb0231a000 00:00 0
2abb04000000-2abb04021000 rw-p 2abb04000000 00:00 0
2abb04021000-2abb08000000 ---p 2abb04021000 00:00 0
7fffa8ab6000-7fffa8acc000 rw-p 7fffa8ab6000 00:00 0
[stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0
[vdso]
Aborted.

kindly  suggest.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100617/f1f26678/attachment.html>

From samuel at unimelb.edu.au  Wed Jun 16 22:05:33 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Thu, 17 Jun 2010 15:05:33 +1000
Subject: [Beowulf] tracejob command shows error
In-Reply-To: <AANLkTik3LELG52Xj89w7kD1BPMaNQA4kYLSi8C5_f2Rq@mail.gmail.com>
References: <AANLkTik3LELG52Xj89w7kD1BPMaNQA4kYLSi8C5_f2Rq@mail.gmail.com>
Message-ID: <D45958078CD65C429557B4C5F492B6A60770E658@IS-EX-BEV3.unimelb.edu.au>

On Thu, 17 Jun 2010 11:47:18 am rigved sharma wrote:

> *** glibc detected *** tracejob: malloc(): memory corruption:
> 0x0000000003a74140 ***

There was a bug reported that looked similar (though 32-bit)
here:

http://www.clusterresources.com/bugzilla/show_bug.cgi?id=49

Can I ask you to get a bugzilla account and add your problem
to that one please ?

The original reporter got rid of his install before we
could track the problem down.

For the record we're running 2.4.8 at VLSCI on CentOS 5.4
(x86-64) without any issues.

cheers,
Chris
-- 
Christopher Samuel          Senior Systems Administrator
VLSCI - Victorian Life Sciences Computational Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
          http://www.vlsci.unimelb.edu.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100617/7a8be0a0/attachment.html>

From rchang.lists at gmail.com  Thu Jun 17 09:31:18 2010
From: rchang.lists at gmail.com (Richard Chang)
Date: Thu, 17 Jun 2010 22:01:18 +0530
Subject: [Beowulf] TEST - Pls ignore
Message-ID: <4C1A4DD6.1030905@gmail.com>

TEST


From Craig.Tierney at noaa.gov  Thu Jun 17 13:59:13 2010
From: Craig.Tierney at noaa.gov (Craig Tierney)
Date: Thu, 17 Jun 2010 14:59:13 -0600
Subject: [Beowulf] Looking for block size settings (from stat) on parallel
	filesystems
In-Reply-To: <4C1A4DD6.1030905@gmail.com>
References: <4C1A4DD6.1030905@gmail.com>
Message-ID: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>


I am looking for a little help to find out what block sizes (as shown
by stat) by Linux based parallel filesystems.

You can find this by running stat on a file.  For example on Lustre:

# stat /lfs0/bigfile 
  File: `/lfs0//bigfile'
  Size: 1073741824	Blocks: 2097160    IO Block: 2097152 regular file
Device: 59924a4a8h/1502839976d	Inode: 45361266    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-06-17 20:24:32.000000000 +0000
Modify: 2010-06-17 20:16:49.000000000 +0000
Change: 2010-06-17 20:16:49.000000000 +0000

If anyone can run this test and provide me with the filesystem
and result (as well as the OS used), it would be a big help.  I am 
specifically looking for GPFS results, but other products (Panasas, 
GlusterFS, NetApp GX) would be helpful.

Why do I care?  Because in netcdf, when nf_open or nf_create are
called, it will use the blocksize that is found in the stat structure.  On
lustre it is 2M so writes are very fast.  However, if the number comes
back as 4k (which some filesystems do), then writes are slower than 
they need to be.  This isn't just a  netcdf issue.  The Linux tool cp does 
the same thing, it will use a block  size that matches the specified 
blocksize of the destination filesystem.

Thanks,
Craig


From jlb17 at duke.edu  Thu Jun 17 14:35:47 2010
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Thu, 17 Jun 2010 17:35:47 -0400 (EDT)
Subject: [Beowulf] Looking for block size settings (from stat) on parallel
	filesystems
In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
References: <4C1A4DD6.1030905@gmail.com>
	<592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
Message-ID: <alpine.LRH.2.00.1006171730200.27296@hogwarts.egr.duke.edu>

On Thu, 17 Jun 2010 at 2:59pm, Craig Tierney wrote

> I am looking for a little help to find out what block sizes (as shown
> by stat) by Linux based parallel filesystems.
>
> You can find this by running stat on a file.  For example on Lustre:
>
> # stat /lfs0/bigfile
>  File: `/lfs0//bigfile'
>  Size: 1073741824	Blocks: 2097160    IO Block: 2097152 regular file
> Device: 59924a4a8h/1502839976d	Inode: 45361266    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-06-17 20:24:32.000000000 +0000
> Modify: 2010-06-17 20:16:49.000000000 +0000
> Change: 2010-06-17 20:16:49.000000000 +0000
>
> If anyone can run this test and provide me with the filesystem
> and result (as well as the OS used), it would be a big help.  I am
> specifically looking for GPFS results, but other products (Panasas,
> GlusterFS, NetApp GX) would be helpful.

GlusterFS 3.0.4 on CentOS-5:
stat pdball.pir
   File: `pdball.pir'
   Size: 155471981       Blocks: 303984     IO Block: 4096   regular file
Device: 21h/33d Inode: 205792080   Links: 1
Access: (0644/-rw-r--r--)  Uid: (11805/database)   Gid: (11805/database)
Access: 2010-06-10 16:55:43.000000000 -0700
Modify: 2010-06-10 06:03:53.000000000 -0700
Change: 2010-06-10 23:36:14.000000000 -0700

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From samuel at unimelb.edu.au  Thu Jun 17 16:58:50 2010
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Fri, 18 Jun 2010 09:58:50 +1000
Subject: [Beowulf] Looking for block size settings (from stat) on parallel
	filesystems
In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
References: <4C1A4DD6.1030905@gmail.com>
	<592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
Message-ID: <D45958078CD65C429557B4C5F492B6A60770E65E@IS-EX-BEV3.unimelb.edu.au>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 18/06/10 06:59, Craig Tierney wrote:

> If anyone can run this test and provide me with the
> filesystem and result (as well as the OS used), it
> would be a big help.  I am specifically looking for
> GPFS results, but other products (Panasas, GlusterFS,
> NetApp GX) would be helpful.

Our Panasas system says:

  File: `hwloc-1.0.1rc1.tar.bz2'
  Size: 1855126         Blocks: 4256       IO Block: 4096   regular file
Device: 18h/24d Inode: 283485546   Links: 1
Access: (0644/-rw-r--r--)  Uid: (  500/  samuel)   Gid: (  506/   vlsci)
Access: 2010-06-03 11:29:54.613258023 +1000
Modify: 2010-05-29 01:24:56.000000000 +1000
Change: 2010-06-03 11:29:54.613258023 +1000

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
         http://www.vlsci.unimelb.edu.au/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwatroACgkQO2KABBYQAh+YCgCeIPo18p9HtjAVn9O5R89Xhm9b
dRIAnROHlQj3WEeM6AbrSZ0fmiej7vBv
=flxS
-----END PGP SIGNATURE-----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100618/59c39718/attachment.html>

From lindahl at pbm.com  Thu Jun 17 18:29:44 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 17 Jun 2010 18:29:44 -0700
Subject: [Beowulf] Looking for block size settings (from stat) on
	parallel filesystems
In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
References: <4C1A4DD6.1030905@gmail.com>
	<592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
Message-ID: <20100618012944.GD32585@bx9.net>

On Thu, Jun 17, 2010 at 02:59:13PM -0600, Craig Tierney wrote:

> Why do I care?  Because in netcdf, when nf_open or nf_create are
> called, it will use the blocksize that is found in the stat structure.  On
> lustre it is 2M so writes are very fast.  However, if the number comes
> back as 4k (which some filesystems do), then writes are slower than 
> they need to be.  This isn't just a  netcdf issue.  The Linux tool cp does 
> the same thing, it will use a block  size that matches the specified 
> blocksize of the destination filesystem.

Craig,

On-node filesystems merge writes in the guts of the block device
system, so I wouldn't be surprised if 4k buffers and 2M buffers were
about the same with ext3. To get an idea if this is the case with
parallel filesystems, if people could measure the speed of dd with
various blocksizes, that would tell you a better answer than just the
blocksize.

But, of course, you will run into the usual issue of write buffering.

-- greg


From kilian.cavalotti.work at gmail.com  Fri Jun 18 01:08:36 2010
From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI)
Date: Fri, 18 Jun 2010 10:08:36 +0200
Subject: [Beowulf] Looking for block size settings (from stat) on parallel
	filesystems
In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
References: <4C1A4DD6.1030905@gmail.com>
	<592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
Message-ID: <AANLkTikvmCs0adxnYXzhY3hyhM0rNSPhzqF8UBvliHZ1@mail.gmail.com>

Craig,

On Thu, Jun 17, 2010 at 10:59 PM, Craig Tierney <Craig.Tierney at noaa.gov> wrote:
> If anyone can run this test and provide me with the filesystem
> and result (as well as the OS used), it would be a big help. ?I am
> specifically looking for GPFS results, but other products (Panasas,
> GlusterFS, NetApp GX) would be helpful.

GPFS 3.3 on RHEL5.5:

# stat /gpfs/bigfile
  File: `/gpfs/bigfile'
  Size: 10737418240     Blocks: 20971520   IO Block: 1048576 regular file
Device: 13h/19d Inode: 127073      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-06-18 10:04:10.555613000 +0200
Modify: 2010-06-18 10:04:58.764714000 +0200
Change: 2010-06-18 10:04:58.764714000 +0200

But block size is really something you can choose when creating the
filesystem (mmcrfs -B).

Cheers,
-- 
Kilian


From john.hearns at mclaren.com  Fri Jun 18 04:30:12 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Fri, 18 Jun 2010 12:30:12 +0100
Subject: [Beowulf] Turboboost/IDA on Nehalem
Message-ID: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>

Does anyone know much about Turboboost on Nehalem?
I would like to have some indication that this is working, and perhaps
measure what effect it has.
I have enabled Turboboost in the BIOS, however when I modprobe
acpi_cpufreq I get

FATAL: Error inserting acpi_cpufreq
(/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a
cpi-cpufreq.ko): No such device


Dave Jones' blog suggest this is a BIOS issue. Any ideas?
I have tried BIOS flags for hardware and software control of CPU
stepping.

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From douglas.guptill at dal.ca  Fri Jun 18 05:27:20 2010
From: douglas.guptill at dal.ca (Douglas Guptill)
Date: Fri, 18 Jun 2010 09:27:20 -0300
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <20100618122720.GB1373@sopalepc>

On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote:
> Does anyone know much about Turboboost on Nehalem?
> I would like to have some indication that this is working, and perhaps
> measure what effect it has.
> I have enabled Turboboost in the BIOS, however when I modprobe
> acpi_cpufreq I get
> 
> FATAL: Error inserting acpi_cpufreq
> (/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a
> cpi-cpufreq.ko): No such device
> 
> 
> Dave Jones' blog suggest this is a BIOS issue. Any ideas?
> I have tried BIOS flags for hardware and software control of CPU
> stepping.

Here is what I get:

scrum:~# uname -a
Linux scrum 2.6.26-2-amd64 #1 SMP Wed May 12 18:03:14 UTC 2010 x86_64 GNU/Linux
scrum:~# modprobe -nv acpi_cpufreq
insmod /lib/modules/2.6.26-2-amd64/kernel/drivers/cpufreq/freq_table.ko 
insmod /lib/modules/2.6.26-2-amd64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpu
scrum:~# modprobe  acpi_cpufreq
scrum:~# 

Which seems to have worked.

HTH,
Douglas.

P.S. `less /proc/cpuinfo` starts with:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping        : 5
cpu MHz         : 2668.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm ida
bogomips        : 5349.70
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

-- 
  Douglas Guptill                       voice: 902-461-9749
  Research Assistant, LSC 4640          email: douglas.guptill at dal.ca
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada


From cap at nsc.liu.se  Fri Jun 18 07:41:22 2010
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Fri, 18 Jun 2010 16:41:22 +0200
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <201006181641.26945.cap@nsc.liu.se>

On Friday 18 June 2010, Hearns, John wrote:
> Does anyone know much about Turboboost on Nehalem?
> I would like to have some indication that this is working, and perhaps
> measure what effect it has.
> I have enabled Turboboost in the BIOS, however when I modprobe
> acpi_cpufreq I get
>
> FATAL: Error inserting acpi_cpufreq
> (/lib/modules/2.6.31.12-0.2-desktop/kernel/arch/x86/kernel/cpu/cpufreq/a
> cpi-cpufreq.ko): No such device

On stock CentOS-5 it looks like this (with dual E5520):
 [root at m1 ~]# uname -r
 2.6.18-194.3.1.el5
 [root at m1 ~]# /etc/init.d/cpuspeed start
 Enabling ondemand cpu frequency scaling:                   [  OK  ]
 [root at m1 ~]# lsmod | grep acpi_cpufreq
 acpi_cpufreq           47937  0
 [root at m1 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
 2268000 2267000 2133000 2000000 1867000 1733000 1600000

Turbo-boost is the +1MHz freq 2268000. When the governor goes looking for the
highest available frequency the CPU goes into turbo-mode. The actual
frequency then is determined by available power and thermal margins but
ultimately limited depending on processor model (my E5520 can do one freq.
step, that is, 2.26 -> 2.4 GHz. X55xx can do two).

In my experience (E5520 and X5550) turbo mode works and gives you extra
performance even when using all cores (HPC load). However, power consumption
typically goes up a lot.

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100618/4636d1c5/attachment.sig>

From eagles051387 at gmail.com  Fri Jun 18 22:32:29 2010
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Sat, 19 Jun 2010 07:32:29 +0200
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <201006181641.26945.cap@nsc.liu.se>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<201006181641.26945.cap@nsc.liu.se>
Message-ID: <AANLkTilgYDyCvTdgBsOvTyGsMx-UBJbsUNiMC-9_JG6b@mail.gmail.com>

if im not mistaken the increase is about 600mHz regardless of the i7 model.
feel free to correct me if im wrong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100619/a06f248e/attachment.html>

From rpnabar at gmail.com  Sat Jun 19 08:24:13 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Sat, 19 Jun 2010 10:24:13 -0500
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <20100618122720.GB1373@sopalepc>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<20100618122720.GB1373@sopalepc>
Message-ID: <AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>

On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill <douglas.guptill at dal.ca> wrote:
> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote:
>> Does anyone know much about Turboboost on Nehalem?
>> I would like to have some indication that this is working, and perhaps
>> measure what effect it has.
>> I have enabled Turboboost in the BIOS, however when I modprobe
>> acpi_cpufreq I get

What's a good way to confirm if  my procs are actually in a turbo
state at a given point of time. It doesn't get reported back through
the usual BIOS channels does it?

-- 
Rahul


From hahn at mcmaster.ca  Sun Jun 20 13:43:42 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sun, 20 Jun 2010 16:43:42 -0400 (EDT)
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <AANLkTilgYDyCvTdgBsOvTyGsMx-UBJbsUNiMC-9_JG6b@mail.gmail.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<201006181641.26945.cap@nsc.liu.se>
	<AANLkTilgYDyCvTdgBsOvTyGsMx-UBJbsUNiMC-9_JG6b@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1006201641050.14149@coffee.psychology.mcmaster.ca>

> if im not mistaken the increase is about 600mHz regardless of the i7 model.
> feel free to correct me if im wrong

no, it varies by model.


From eagles051387 at gmail.com  Sun Jun 20 14:02:48 2010
From: eagles051387 at gmail.com (Jonathan Aquilina)
Date: Sun, 20 Jun 2010 23:02:48 +0200
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <Pine.LNX.4.64.1006201641050.14149@coffee.psychology.mcmaster.ca>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<201006181641.26945.cap@nsc.liu.se>
	<AANLkTilgYDyCvTdgBsOvTyGsMx-UBJbsUNiMC-9_JG6b@mail.gmail.com> 
	<Pine.LNX.4.64.1006201641050.14149@coffee.psychology.mcmaster.ca>
Message-ID: <AANLkTilhsrgugU38vsrDeiVhv0eKClsNG1-MRi1Kb0fY@mail.gmail.com>

whats the range that it varies by?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100620/cc4275b8/attachment.html>

From hahn at mcmaster.ca  Sun Jun 20 14:09:16 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sun, 20 Jun 2010 17:09:16 -0400 (EDT)
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <AANLkTilhsrgugU38vsrDeiVhv0eKClsNG1-MRi1Kb0fY@mail.gmail.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<201006181641.26945.cap@nsc.liu.se>
	<AANLkTilgYDyCvTdgBsOvTyGsMx-UBJbsUNiMC-9_JG6b@mail.gmail.com>
	<Pine.LNX.4.64.1006201641050.14149@coffee.psychology.mcmaster.ca>
	<AANLkTilhsrgugU38vsrDeiVhv0eKClsNG1-MRi1Kb0fY@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1006201706430.14149@coffee.psychology.mcmaster.ca>

> whats the range that it varies by?

http://ark.intel.com/MySearch.aspx?TBT=true

but seriously, you could have found this with 3 clicks or so.


From jamesb at loreland.org  Mon Jun 21 02:37:43 2010
From: jamesb at loreland.org (James Braid)
Date: Mon, 21 Jun 2010 10:37:43 +0100
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<20100618122720.GB1373@sopalepc>
	<AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>
Message-ID: <AANLkTimSW0I8A6JFhc-RhMNEEpJFawf4-B0euqCOfMOM@mail.gmail.com>

On Sat, Jun 19, 2010 at 16:24, Rahul Nabar <rpnabar at gmail.com> wrote:
> On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill <douglas.guptill at dal.ca> wrote:
>> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote:
>>> Does anyone know much about Turboboost on Nehalem?
>>> I would like to have some indication that this is working, and perhaps
>>> measure what effect it has.
>>> I have enabled Turboboost in the BIOS, however when I modprobe
>>> acpi_cpufreq I get
>
> What's a good way to confirm if ?my procs are actually in a turbo
> state at a given point of time. It doesn't get reported back through
> the usual BIOS channels does it?

turbostat from pmtools is a great little tool for seeing this type of info.

http://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

FWIW acpi_cpufreq doesn't load on any of our Nehalem based systems
(HP, Dell, running on Fedora 12) - but turbo boost and all the other
power efficiency features work fine as far as I can tell.


From bs_lists at aakef.fastmail.fm  Thu Jun 17 16:08:35 2010
From: bs_lists at aakef.fastmail.fm (Bernd Schubert)
Date: Fri, 18 Jun 2010 01:08:35 +0200
Subject: [Beowulf] Looking for block size settings (from stat) on parallel
	filesystems
In-Reply-To: <592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
References: <4C1A4DD6.1030905@gmail.com>
	<592EB061-9728-4C70-B04F-96FA81DF9CD3@noaa.gov>
Message-ID: <201006180108.35316.bs_lists@aakef.fastmail.fm>

On Thursday 17 June 2010, Craig Tierney wrote:
> I am looking for a little help to find out what block sizes (as shown
> by stat) by Linux based parallel filesystems.
> 
> You can find this by running stat on a file.  For example on Lustre:
> 
> # stat /lfs0/bigfile
>   File: `/lfs0//bigfile'
>   Size: 1073741824	Blocks: 2097160    IO Block: 2097152 regular file
> Device: 59924a4a8h/1502839976d	Inode: 45361266    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-06-17 20:24:32.000000000 +0000
> Modify: 2010-06-17 20:16:49.000000000 +0000
> Change: 2010-06-17 20:16:49.000000000 +0000
> 
> If anyone can run this test and provide me with the filesystem
> and result (as well as the OS used), it would be a big help.  I am
> specifically looking for GPFS results, but other products (Panasas,
> GlusterFS, NetApp GX) would be helpful.
> 
> Why do I care?  Because in netcdf, when nf_open or nf_create are
> called, it will use the blocksize that is found in the stat structure.  On
> lustre it is 2M so writes are very fast.  However, if the number comes
> back as 4k (which some filesystems do), then writes are slower than
> they need to be.  This isn't just a  netcdf issue.  The Linux tool cp does
> the same thing, it will use a block  size that matches the specified
> blocksize of the destination filesystem.


Probably a bit hackish, but it would be very simple to write an overlay fuse 
filesystem, which would allow to modify that parameter. Unfortunately, we also 
would need to modify fuse, as current maximum through fuse are 128KB. Although 
it also would be easy to change those defines. However, I'm not sure if RedHat 
backported those patches to allow large IO sizes through fuse at all. If not, 
glusterfs on RedHat also only will send 4KB requests.


Cheers,
Bernd


From robl at mcs.anl.gov  Fri Jun 18 13:29:37 2010
From: robl at mcs.anl.gov (Rob Latham)
Date: Fri, 18 Jun 2010 15:29:37 -0500
Subject: [Beowulf] [hpc-announce] Deadline extended: CFP: Workshop on
 Interfaces and Abstractions for Scientific Data Storage (IASDS 2010)
Message-ID: <20100618202937.GJ4073@mcs.anl.gov>

We have extended the deadline for submission to IASDS 2010 by one week


CALL FOR PAPERS: 

IASDS 2010 (http://www.mcs.anl.gov/events/workshops/iasds10/)
In conjunction with IEEE Cluster 2010 (http://www.cluster2010.org/)

High-performance computing simulations and large scientific
experiments such as those in high energy physics generate tens of
terabytes of data, and these data sizes grow each year. Existing
systems for storing, managing, and analyzing data are being pushed to
their limits by these applications, and new techniques are necessary
to enable efficient data processing for future simulations and
experiments.

This workshop will provide a forum for engineers and scientists to
present and discuss their most recent work related to the storage,
management, and analysis of data for scientific workloads. Emphasis
will be placed on forward-looking approaches to tackle the challenges
of storage at extreme scale or to provide better abstractions for use
in scientific workloads.

TOPICS OF INTEREST:

Topics of interest include, but are not limited to:

- parallel file systems
- scientific databases
- active storage
- scientific I/O middleware
- extreme scale storage

PAPER SUBMISSION

Workshop papers will be peer-reviewed and will appear as part of the
IEEE Cluster 2010 proceedings. Submissions must follow the Cluster
2010 format:

PDF files only.
Maximum 10 pages.
Single-spaced
8.5x11-inch, Two-column numbered pages in IEEE Xplore format

IMPORTANT DATES:

Paper Submission Deadline: now June 28, 2010
Author Notification: July 16, 2010
Final Manuscript: July 30, 2010
Workshop: September 24, 2010

PROGRAM COMMITTEE:
Program Committee

Robert Latham, Argonne National Laboratory
Quincey Koziol, The HDF Group
Pete Wyckoff, Netapp
Wei-Keng Liao, Northwestern University
Florin Isalia, Universidad Carlos III de Madrid
Katie Antypas, NERSC  
Anshu Dubey, FLASH  
Dean Hildebrand, IBM Almaden  
Bradley Settlemyer, Oak Ridge National Laboratory


-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


From turnerg at indiana.edu  Sat Jun 19 09:37:48 2010
From: turnerg at indiana.edu (George Wm Turner)
Date: Sat, 19 Jun 2010 12:37:48 -0400
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<20100618122720.GB1373@sopalepc>
	<AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>
Message-ID: <365A08AC-E47F-4F10-A5A1-D9E5BA9AB5E0@indiana.edu>

cat /proc/cpuinfo
Look for a clock rate higher then the chip's rated clock speed.
You must have cpuspeed enabled (re: redhat:  service cpuspeed start)

On multi core chips, turbomode comes into play when the chip is lighty  
loaded and the idle cores can be clocked down and that power divereted  
to the core(s) actually running code.  On an idle system,  you may  
notice that all the cpus in /proc/cpuinfo" say they're running at the  
higher clock speeds;  it's an illusion; they ain't doin' nuttin.

george wm turner
high performance systems
812 855 5156


On Jun 19, 2010, at 11:24 AM, Rahul Nabar wrote:

> On Fri, Jun 18, 2010 at 7:27 AM, Douglas Guptill <douglas.guptill at dal.ca 
> > wrote:
>> On Fri, Jun 18, 2010 at 12:30:12PM +0100, Hearns, John wrote:
>>> Does anyone know much about Turboboost on Nehalem?
>>> I would like to have some indication that this is working, and  
>>> perhaps
>>> measure what effect it has.
>>> I have enabled Turboboost in the BIOS, however when I modprobe
>>> acpi_cpufreq I get
>
> What's a good way to confirm if  my procs are actually in a turbo
> state at a given point of time. It doesn't get reported back through
> the usual BIOS channels does it?
>
> -- 
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From K.Weiss at science-computing.de  Mon Jun 21 04:00:16 2010
From: K.Weiss at science-computing.de (Karsten Weiss)
Date: Mon, 21 Jun 2010 13:00:16 +0200 (CEST)
Subject: [Beowulf] Turboboost/IDA on Nehalem
In-Reply-To: <AANLkTimSW0I8A6JFhc-RhMNEEpJFawf4-B0euqCOfMOM@mail.gmail.com>
References: <68A57CCFD4005646957BD2D18E60667B10CDC0DA@milexchmb1.mil.tagmclarengroup.com>
	<20100618122720.GB1373@sopalepc>
	<AANLkTinY4o5PrO9C3pfxtGi_vsL7waC9eWzis99YapAf@mail.gmail.com>
	<AANLkTimSW0I8A6JFhc-RhMNEEpJFawf4-B0euqCOfMOM@mail.gmail.com>
Message-ID: <alpine.LNX.2.00.1006211253180.30018@wong.science-computing.de>

On Mon, 21 Jun 2010, James Braid wrote:

> > What's a good way to confirm if ?my procs are actually in a turbo
> > state at a given point of time. It doesn't get reported back through
> > the usual BIOS channels does it?
> 
> turbostat from pmtools is a great little tool for seeing this type of info.
> 
> http://kernel.org/pub/linux/kernel/people/lenb/acpi/utils/
> 
> FWIW acpi_cpufreq doesn't load on any of our Nehalem based systems
> (HP, Dell, running on Fedora 12) - but turbo boost and all the other
> power efficiency features work fine as far as I can tell.

Another alternative is i7z: http://code.google.com/p/i7z/

-- 
Karsten Weiss / science + computing ag

From prentice at ias.edu  Fri Jun 25 07:28:15 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 25 Jun 2010 10:28:15 -0400
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
Message-ID: <4C24BCFF.1040007@ias.edu>

Beowulfers,

One of my Fortran programmers had to increase the precision of his
program so he switched from REAL*8 to REAL*16 which changes the size of
his variables from 64 bits to 128 bits. The program now takes 32x longer
to run.

I'm not an expert on processor archtitecture, etc., but I do know that
once the size of a variable exceeds the size of the processors
registers, things will slow down considerably. Is his 32x performance
degradation in line with this?

Is there any way to reduce this degradation? Would The GNU GMP library
(or some other library) help speed things up?


-- 
Prentice


From landman at scalableinformatics.com  Fri Jun 25 07:51:52 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Fri, 25 Jun 2010 10:51:52 -0400
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <4C24BCFF.1040007@ias.edu>
References: <4C24BCFF.1040007@ias.edu>
Message-ID: <4C24C288.5090604@scalableinformatics.com>

Prentice Bisbal wrote:
> Beowulfers,
> 
> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.
> 
> I'm not an expert on processor archtitecture, etc., but I do know that
> once the size of a variable exceeds the size of the processors
> registers, things will slow down considerably. Is his 32x performance
> degradation in line with this?

At least 4x more work is often the case. 32x doesn't sound unreasonable.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From prentice at ias.edu  Fri Jun 25 07:53:11 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 25 Jun 2010 10:53:11 -0400
Subject: [Beowulf] Thread hijacking
In-Reply-To: <4B352F20504D274FBEB616B2247487F7C25C15@exchange2003.highdata.com>
References: <4C24BCFF.1040007@ias.edu>
	<4B352F20504D274FBEB616B2247487F7C25C15@exchange2003.highdata.com>
Message-ID: <4C24C2D7.20003@ias.edu>

Jaime,

If you're going to post something like this to our mailing list, please
do not reply to someone else's email on a different topic. This is known
as thread hijacking, and screws up the flow of the conversation,
especially in the mailing list archives. It's bad etiquette.

Since this position is for someone with cluster skills, it is
appropriate to post it here - as it's own thread.

Prentice


Jaime Biggs wrote:
> Hello,
> 
> I was wondering if anyone had any interest in this opportunity?  If not
> we can pay you $1,000 if we can place the person you refer to me.
> 
> 
> 
> Location: Cambridge, MA 
> Duration: 3-6 months possible long term 
> Senior Support Analyst and Beowulf Cluster Administrator 
> 
> From Manager: 
> 
> This is the profile I need. Will be working on a Centrfy project (synchs
> unix accounts with Active Directory), completing the set up of the new
> linux cluster using Cluster File Services for connection to the storage
> appliance and Sun Grid Engine as the cluster management application, and
> creating a VM server and installing PipelinePilot for one research
> group. 
> 
> KEYS: Worked with USERS, Managed a Cluster (Linux-Sungard), and Linux.
> Centrfy project (synchs unix accounts with Active Directory), completing
> the set up of the new linux cluster using Cluster File Services for
> connection to the storage appliance and Sun Grid Engine as the cluster
> management application, and creating a VM server and installing
> PipelinePilot for one research group. 
> 
> *5 to 10 years in system administration and application support in a
> Unix/Linux environment 
> *3 to 5 years experience in Shell scripting (tcsh and bash), TCP/IP,
> NFS, CIFS (SMB), PAM, Apache, JBoss, DHCP and SystemImager. 
> *2 to 3 years experience managing a high availability Linux cluster;
> experience with Cluster File Systems, Beowulf parallel clusters, Moab,
> Torque/PBS, MPI and OS upgrades or equivalent products/environments 
> *Working knowledge of Perl, network information directories (like NIS,
> Active Directory, LDAP, K), PHP, Java, gnome, KDE 
> *Experience in configuring Beowulf clusters using Sun Grid Engine 
> 
> 
> Best Regards, 
> 
> Jaime Biggs
> Director of Recruitment
> Ivesia Solutions, Inc.
> 2 Keewaydin Dr.
> Salem, NH 03079
> tel: 800-871-1510 x 2407
> tel: 603-685-2407
> fax: 603-890-1276
> jbiggs at ivesia.com
> www.ivesia.com
> 
> Jaime Biggs
> Director of Recruitment
> Ivesia Solutions, Inc.
> 2 Keewaydin Dr.
> Salem, NH 03079
> tel: 800-871-1510 x 2407
> tel: 603-685-2407
> fax: 603-890-1276
> jbiggs at ivesia.com
> www.ivesia.com 
> 
> 
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> On Behalf Of Prentice Bisbal
> Sent: Friday, June 25, 2010 10:28 AM
> To: Beowulf Mailing List
> Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
> 
> Beowulfers,
> 
> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.
> 
> I'm not an expert on processor archtitecture, etc., but I do know that
> once the size of a variable exceeds the size of the processors
> registers, things will slow down considerably. Is his 32x performance
> degradation in line with this?
> 
> Is there any way to reduce this degradation? Would The GNU GMP library
> (or some other library) help speed things up?
> 
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ


From ljdursi at scinet.utoronto.ca  Fri Jun 25 08:13:20 2010
From: ljdursi at scinet.utoronto.ca (Jonathan Dursi)
Date: Fri, 25 Jun 2010 11:13:20 -0400
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <4C24BCFF.1040007@ias.edu>
References: <4C24BCFF.1040007@ias.edu>
Message-ID: <28872C7E-BD70-4E68-906E-6E3A3342D98B@scinet.utoronto.ca>

I can certainly imagine 2-8x slowdown; 4x for say multiplication (I believe AMD64 doesn't support quad-precision in hardware, so everything has to be emulated) and 2x for the extra memory bandwidth.   32x seems harsh, but isn't obviously crazy.

This sounds a lot like blindly using a sledgehammer, though.   If the user absolutely requires quad-precision everywhere because they need precision everywhere in their calculation better than one part in 1e16, then they're basically just doomed; but there are very few applications in that regime.  Likely there's some part of their problem which is particularly sensitive to the numerics (or they're just using crappy numerics everywhere).

One nice thing about the flurry of GPGPU activity is that it's inspired a resurgence of interest in `mixed precision algorithms', where parts of the numerics are implemented (or emulated) at very high precision, and others are implemented at lower precision.    It might be worth googling around a bit  for their particular problem to see if people have implemented that sort of approach for their particular problem.

Of course if they really really need quad precision they should find an architecture (the Power series) that supports quad precision in hardware; but they'll always end up having to pay the 2x memory bandwidth penalty, no way around that.

The Gnu GMP, which is very cute and well implemented, is definitely not a way to make things go *faster*.   It may well be faster than the other arbitrary-precision libraries out there, but I would expect it to be slower than (fixed) quad precision.   On the other hand, if there's only a small portion of the code that needs that approach and the rest can be done in double, there may not be a huge speed penalty.

     Jonathan

-- 
Jonathan Dursi <ljdursi at scinet.utoronto.ca>


From niftyompi at niftyegg.com  Fri Jun 25 08:54:24 2010
From: niftyompi at niftyegg.com (Nifty Tom Mitchell)
Date: Fri, 25 Jun 2010 08:54:24 -0700
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <4C24BCFF.1040007@ias.edu>
References: <4C24BCFF.1040007@ias.edu>
Message-ID: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>

On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote:
> 
> Beowulfers,
> 
> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.
> 

I am surprised that it works as support in things like
the math lib, log and trig functions could be missing.
Which compiler is he using?


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?


From ntmoore at gmail.com  Fri Jun 25 10:30:39 2010
From: ntmoore at gmail.com (Nathan Moore)
Date: Fri, 25 Jun 2010 12:30:39 -0500
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
References: <4C24BCFF.1040007@ias.edu>
	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
Message-ID: <AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>

I agree, it seems odd that the OS/compiler has a 128 bit math library
available. I certainly hope that your seemingly correct answer is not
being corrupted when you compute a 64-bit sine of a 128 bit number...

I used GMP's arbitrary precision library (rational number arithmetic)
for my thesis a few years back.  It was very easy to implement, but
not fast (better on x86 hardware than sun/sgi/power as I recall).  I
too am curious about the sort of algorithm that would require that
much precision.  (for me, it was inverting a probability distribution
that was piecewise-defined, described here,
http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk
about their thesis...).

Nathan


On Fri, Jun 25, 2010 at 10:54 AM, Nifty Tom Mitchell
<niftyompi at niftyegg.com> wrote:
> On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote:
>>
>> Beowulfers,
>>
>> One of my Fortran programmers had to increase the precision of his
>> program so he switched from REAL*8 to REAL*16 which changes the size of
>> his variables from 64 bits to 128 bits. The program now takes 32x longer
>> to run.
>>
>
> I am surprised that it works as support in things like
> the math lib, log and trig functions could be missing.
> Which compiler is he using?
>
>
>
>
>
> --
> ? ? ? ?T o m ?M i t c h e l l
> ? ? ? ?Found me a new hat, now what?
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
- - - - - - -   - - - - - - -   - - - - - - -
Nathan Moore
Assistant Professor, Physics
Winona State University
AIM: nmoorewsu
- - - - - - -   - - - - - - -   - - - - - - -


From prentice at ias.edu  Fri Jun 25 10:35:52 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 25 Jun 2010 13:35:52 -0400
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
References: <4C24BCFF.1040007@ias.edu>
	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
Message-ID: <4C24E8F8.5050203@ias.edu>

Nifty Tom Mitchell wrote:
> On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote:
>> Beowulfers,
>>
>> One of my Fortran programmers had to increase the precision of his
>> program so he switched from REAL*8 to REAL*16 which changes the size of
>> his variables from 64 bits to 128 bits. The program now takes 32x longer
>> to run.
>>
> 
> I am surprised that it works as support in things like
> the math lib, log and trig functions could be missing.
> Which compiler is he using?
> 

I'm 99% sure he's using the Intel Fortran Compiler, ifort.

-- 
Prentice


From prentice at ias.edu  Fri Jun 25 10:57:41 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Fri, 25 Jun 2010 13:57:41 -0400
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
References: <4C24BCFF.1040007@ias.edu>	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
	<AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
Message-ID: <4C24EE15.5030201@ias.edu>

This may be naive, but I assumed that if the language supports real*16,
then the language and compiler would have to support all of the
functions that are native to the language (add, subtract, and whatever
else the Fortran standard specifies), would have to handle real*16
operands as well. Is that a correct assumption?

I can see how more complicated functions and libraries built off the
simpler functions would be a problem.

I forwarded some of the e-mails from this discussion to the programmer,
so he'd understand the possible issues.

I don't know exactly what problem this person is tackling with this
program, but I can say he is a theoretical physicist whose research
includes quantum mechanics and field theory.

Prentice


Nathan Moore wrote:
> I agree, it seems odd that the OS/compiler has a 128 bit math library
> available. I certainly hope that your seemingly correct answer is not
> being corrupted when you compute a 64-bit sine of a 128 bit number...
> 
> I used GMP's arbitrary precision library (rational number arithmetic)
> for my thesis a few years back.  It was very easy to implement, but
> not fast (better on x86 hardware than sun/sgi/power as I recall).  I
> too am curious about the sort of algorithm that would require that
> much precision.  (for me, it was inverting a probability distribution
> that was piecewise-defined, described here,
> http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk
> about their thesis...).
> 
> Nathan
> 
> 
> 
> On Fri, Jun 25, 2010 at 10:54 AM, Nifty Tom Mitchell
> <niftyompi at niftyegg.com> wrote:
>> On Fri, Jun 25, 2010 at 10:28:15AM -0400, Prentice Bisbal wrote:
>>> Beowulfers,
>>>
>>> One of my Fortran programmers had to increase the precision of his
>>> program so he switched from REAL*8 to REAL*16 which changes the size of
>>> his variables from 64 bits to 128 bits. The program now takes 32x longer
>>> to run.
>>>
>> I am surprised that it works as support in things like
>> the math lib, log and trig functions could be missing.
>> Which compiler is he using?
>>
>>
>>
>>
>>
>> --
>>        T o m  M i t c h e l l
>>        Found me a new hat, now what?
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 
> 
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ


From mathog at caltech.edu  Fri Jun 25 11:26:20 2010
From: mathog at caltech.edu (David Mathog)
Date: Fri, 25 Jun 2010 11:26:20 -0700
Subject: [Beowulf] Re: Peformance penalty when using 128-bit reals on AMD64
Message-ID: <E1OSDbY-0004Sb-Uh@mendel.bio.caltech.edu>

Prentice Bisbal <prentice at ias.edu> wrote:

> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.

Doubling the size of the variables can increase the size of the arrays
that hold them so that what once fit comfortably into the fastest parts
of memory no longer does.  Depending on memory access patterns that can
easily result in a 16X drop in speed.  (2X would normally be lost either
way because there is twice as much data to move.)

You didn't say why he had to change the precision.  If it was a
numerical stability issue, well, if the algorithm doesn't work for R*8
going to R*16 may not be a reliable way to calm things down.  If this is
a case where the exponents are out of range perhaps the whole problem
can be scaled up or down by some constant factor so that the numbers
once again fit into the range of exponents supported by R*8?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From glykos at mbg.duth.gr  Fri Jun 25 15:04:38 2010
From: glykos at mbg.duth.gr (Nicholas M Glykos)
Date: Sat, 26 Jun 2010 01:04:38 +0300 (EEST)
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
References: <4C24BCFF.1040007@ias.edu>
	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
	<AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
Message-ID: <Pine.LNX.4.62.1006260100020.24062@aspera.cluster.mbg.gr>


<snip>
> http://arxiv.org/abs/cond-mat/0506786, sorry, nobody ever gets to talk
> about their thesis...).
</snip>

:-))   (sorry, sorry, I couldn't resist the temptation).


-- 


          Dr Nicholas M. Glykos, Department of Molecular Biology
     and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
    Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


From derekr42 at gmail.com  Fri Jun 25 09:01:34 2010
From: derekr42 at gmail.com (Derek R.)
Date: Fri, 25 Jun 2010 11:01:34 -0500
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <4C24BCFF.1040007@ias.edu>
References: <4C24BCFF.1040007@ias.edu>
Message-ID: <AANLkTimpuLRD6sw06_bDlm_tlTVP_SYDm_zUCGcXDTiT@mail.gmail.com>

Prentice,
As was said before, I don't believe that x64 processor architectures support
128 bit precision instructions either (I did glance through the official AMD
manuals, and I've read the first 3 in the set for another project, and I
can't recall anything about operating on variables that large; storing
values of that precision, yes, but not multiplying and storing the results
in registers). The results would overflow the registers and then you'd have
to fall back on cache (which could be entirely doable, but you'd have to
code in assembler to ensure that (a) the results don't fall out of cache and
(b) that you are fetching the proper cache lines to obtain your results) or
main memory (which would once again involve coding in assembly language).
One way I think you might be able to do this is via some of the SIMD
multimedia instructions built into the processor. I only gave that volume of
the x64 (x86_64, AMD64, tomato-vs-tomato) manuals a cursory glance as that's
never been my concern, but I do believe that the processor architecture does
indeed support that level of precision and has the instructions to store the
rather large results in contiguous registers. Of course, I don't know what
this would do to your code.
I'd suggest 4 things :
1) Order a set of the AMD64 manuals (they used to be free, not sure now)
from AMD
2) Look at a cheap, brute force solution - I'd suggest SSD disks for swap,
perhaps (that's the most likely way I can think of the performance
degradation you're seeing happening - going out to swap - it's easy and
cheap to test on one system, and if it reduces it to a more acceptable wall
clock time then see if you can live with that)
3) Find a project that utilizes the CPU's performance counters and measure
exactly what is happening - it could be something quite simple that the
compiler is doing wrong and you can fix w/ a few flags or a little bit of
inline assembly code (I'm no FORTRAN programmer, but whatever standard
you're using should support it if the compiler does, and most of them
do)...I haven't done this in quite a while, perfctr used to be the standard.
What's the current Linux best-practice standard?
4) Start investigating other solutions in terms of CPU/GPU solutions (if
it's that important)

That's my $0.02 USD that I can add to this discussion on very little sleep,
I'll mail you if further inspiration hits with more espresso. I hope it
helps. And I can't really comment of the feasibility of GMP libraries as
I've never used them.
Regards,
Derek R.

On Fri, Jun 25, 2010 at 9:28 AM, Prentice Bisbal <prentice at ias.edu> wrote:

> Beowulfers,
>
> One of my Fortran programmers had to increase the precision of his
> program so he switched from REAL*8 to REAL*16 which changes the size of
> his variables from 64 bits to 128 bits. The program now takes 32x longer
> to run.
>
> I'm not an expert on processor archtitecture, etc., but I do know that
> once the size of a variable exceeds the size of the processors
> registers, things will slow down considerably. Is his 32x performance
> degradation in line with this?
>
> Is there any way to reduce this degradation? Would The GNU GMP library
> (or some other library) help speed things up?
>
>
> --
> Prentice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100625/367ed19c/attachment.html>

From amjad11 at gmail.com  Sat Jun 26 13:12:15 2010
From: amjad11 at gmail.com (amjad ali)
Date: Sat, 26 Jun 2010 16:12:15 -0400
Subject: [Beowulf] MPI Persistent Comm Question
Message-ID: <AANLkTikMamxJp8_UETwFyd8-pWdHl6VewENwktA7l5Mm@mail.gmail.com>

Hi all,

What is the be the best way of using MPI persistent communication in an
iterative/repetative kind of code about calling MPI_Free(); Should we call
MPI_Free() in every iteration or
only once when all the iterations/repetitions are performed?
Means which one is the best out of following two:

(1)
Call this subroutines 1000 times
=============================
             call MPI_RECV_Init()
             call MPI_Send_Init()
             call MPI_Startall()
             call MPI_Free()
=============================

(2)
Call this subroutines 1000 times
===========================
             call MPI_RECV_Init()
             call MPI_Send_Init()
             call MPI_Startall()
==========================
            call MPI_Free()  --------- call it only once at the end.


Thanks in advance.
best regards
AA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100626/bb033781/attachment.html>

From niftyompi at niftyegg.com  Sat Jun 26 15:40:24 2010
From: niftyompi at niftyegg.com (NiftyOMPI Tom Mitchell)
Date: Sat, 26 Jun 2010 15:40:24 -0700
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
References: <4C24BCFF.1040007@ias.edu>
	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
	<AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
Message-ID: <AANLkTimyK5SMopRbaeGdloAIwmqzVNTU0DRnF6n2WHGs@mail.gmail.com>

On Fri, Jun 25, 2010 at 10:30 AM, Nathan Moore <ntmoore at gmail.com> wrote:
...snip...
> I used GMP's arbitrary precision library (rational number arithmetic)
> for my thesis a few years back. ?It was very easy to implement, but
> not fast (better on x86 hardware than sun/sgi/power as I recall). ?I
> too am curious about the sort of algorithm that would require that
> much precision.

A lot can depend on the dynamic range of the
values being operated on.    If there is a
mix of very large and very small values odd
results can surface especially in parallel code.

Also basic statistics where the squared values can sometimes
unexpectedly overflow a computation when the "input" is well within
range.

It does make a lot of sense to test code and data with 32 and 64 bit
floating point to see if odd results surface.   It would be nice
if systems+compilers had the option of 128 and even 256 bit
operations so code could be tested for sensitivity that
matters.

I sort of wish precision was universally an application ./configure or
a #define
and while I am dreaming, 128 and 256 bit versions would just run...
A +24x slower run would validate a lot of codes and critical runs.

In a 30 second scan of GMP's arbitrary precision library I cannot tell
if 32 and 64bit sizes fall out as equal in performance to native types.

-- 
        NiftyOMPI
        T o m   M i t c h e l l


From lindahl at pbm.com  Sat Jun 26 17:39:34 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Sat, 26 Jun 2010 17:39:34 -0700
Subject: [Beowulf] Peformance penalty when using 128-bit reals on AMD64
In-Reply-To: <AANLkTimyK5SMopRbaeGdloAIwmqzVNTU0DRnF6n2WHGs@mail.gmail.com>
References: <4C24BCFF.1040007@ias.edu>
	<20100625155424.GB2263@tosh2egg.ca.sanfran.comcast.net>
	<AANLkTilTdS1kwGZwlbymbOC76xibo8NFZJAhg52xHGT7@mail.gmail.com>
	<AANLkTimyK5SMopRbaeGdloAIwmqzVNTU0DRnF6n2WHGs@mail.gmail.com>
Message-ID: <20100627003934.GJ21079@bx9.net>

On Sat, Jun 26, 2010 at 03:40:24PM -0700, NiftyOMPI Tom Mitchell wrote:

> In a 30 second scan of GMP's arbitrary precision library I cannot tell
> if 32 and 64bit sizes fall out as equal in performance to native types.

No. It's great for arbitrary large sizes and not so good for 128 bits,
compared to a library that does only 128 bits. It makes no attempt to
do 64 or 32 bit stuff using native types.

-- greg


From Bill.Rankin at sas.com  Mon Jun 28 09:05:12 2010
From: Bill.Rankin at sas.com (Bill Rankin)
Date: Mon, 28 Jun 2010 16:05:12 +0000
Subject: [Beowulf] MPI Persistent Comm Question
In-Reply-To: <AANLkTikMamxJp8_UETwFyd8-pWdHl6VewENwktA7l5Mm@mail.gmail.com>
References: <AANLkTikMamxJp8_UETwFyd8-pWdHl6VewENwktA7l5Mm@mail.gmail.com>
Message-ID: <76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com>

Uhmmm, what is "MPI_Free()"?

It does not appear to be part of the MPI2 standard.  MPI_Free_mem() is part of the standard, but is used in conjunction with MPI_Alloc_mem() and doesn't seem to refer to what you are describing here.

Is it a local procedure and if so, what does it do?

-bill


From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of amjad ali
Sent: Saturday, June 26, 2010 4:12 PM
To: Beowulf Mailing List
Subject: [Beowulf] MPI Persistent Comm Question

Hi all,

What is the be the best way of using MPI persistent communication in an iterative/repetative kind of code about calling MPI_Free(); Should we call MPI_Free() in every iteration or
only once when all the iterations/repetitions are performed?
Means which one is the best out of following two:

(1)
Call this subroutines 1000 times
=============================
             call MPI_RECV_Init()
             call MPI_Send_Init()
             call MPI_Startall()
             call MPI_Free()
=============================

(2)
Call this subroutines 1000 times
===========================
             call MPI_RECV_Init()
             call MPI_Send_Init()
             call MPI_Startall()
==========================
            call MPI_Free()  --------- call it only once at the end.


Thanks in advance.
best regards
AA


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100628/9236195b/attachment.html>

From hahn at mcmaster.ca  Mon Jun 28 12:40:26 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 28 Jun 2010 15:40:26 -0400 (EDT)
Subject: [Beowulf] MPI Persistent Comm Question
In-Reply-To: <76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com>
References: <AANLkTikMamxJp8_UETwFyd8-pWdHl6VewENwktA7l5Mm@mail.gmail.com>
	<76097BB0C025054786EFAB631C4A2E3C0932CF8F@MERCMBX04R.na.SAS.com>
Message-ID: <Pine.LNX.4.64.1006281532450.2934@coffee.psychology.mcmaster.ca>

> Uhmmm, what is "MPI_Free()"?

he probably meant MPI_Request_free

> (1)
> Call this subroutines 1000 times
> =============================
>             call MPI_RECV_Init()
>             call MPI_Send_Init()
>             call MPI_Startall()
>             call MPI_Free()
> =============================
>
> (2)
> Call this subroutines 1000 times
> ===========================
>             call MPI_RECV_Init()
>             call MPI_Send_Init()
>             call MPI_Startall()
> ==========================
>            call MPI_Free()  --------- call it only once at the end.

I've never even seen these MPI_Start-related interfaces in use, 
but MPI_Request_free appears to be callable once per MPI_*_Init.
that would argue for pairing as in the first sequence.

afaikt, the point of the interface is actually init/start+/free - 
that is, you set up a persistent send and kick it into action 
many times.  but only free it once.


From rpnabar at gmail.com  Tue Jun 29 13:26:20 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Tue, 29 Jun 2010 15:26:20 -0500
Subject: [Beowulf] Survey how industrial companies use their HPC Resources
In-Reply-To: <814583.50541.qm@web30602.mail.mud.yahoo.com>
References: <20100615201226.GF21791@bx9.net>
	<814583.50541.qm@web30602.mail.mud.yahoo.com>
Message-ID: <AANLkTikYSVnY_ekEQs_pMv4Ocr-h125US4Kjfhn9Ji5I@mail.gmail.com>

On Wed, Jun 16, 2010 at 7:08 PM, Buccaneer for Hire.
<buccaneer at rocketmail.com> wrote:
> --- On Tue, 6/15/10, Greg Lindahl <lindahl at pbm.com> wrote:
>
>> From: Greg Lindahl <lindahl at pbm.com>
>>
>> I noticed that the country list doesn't include the USA.
>>
> I just made the assumption if you left it blank that it was understood. :)

I noticed that under interconnect options there was no 10GigE listed.
Are any others on the list using this or am I the only one.

-- 
Rahul


From rpnabar at gmail.com  Tue Jun 29 13:37:47 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Tue, 29 Jun 2010 15:37:47 -0500
Subject: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
In-Reply-To: <4C08BBCA.5070508@diamond.ac.uk>
References: <201005251900.o4PJ0ElP016422@bluewest.scyld.com>
	<20100525194056.GB16022@kaizen.mayo.edu>
	<4C08BBCA.5070508@diamond.ac.uk>
Message-ID: <AANLkTinTZqcQyBiyYSxeZYCE6s5XYqzcT6gWL63rGXJV@mail.gmail.com>

On Fri, Jun 4, 2010 at 3:39 AM, Tina Friedrich
<Tina.Friedrich at diamond.ac.uk> wrote:
> We've had that happen on some of our servers. Currently using the
> disable_msi workaround, which seems to have stopped it. I believe there's
> supposed to be a fix in the latest Red Hat kernel but we haven't really
> tested that yet.

I saw the exact same symptoms as Tina. Not a hard-correlation but I
mostly saw it during periods of high NFS loads. The disable_msi
workaround does work like a charm.

Before that my only option was to log in locally via console and then
do a ifdown; ifup on the interface.

-- 
Rahul


From rpnabar at gmail.com  Tue Jun 29 22:30:12 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 30 Jun 2010 00:30:12 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
Message-ID: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>

The Top500 list has many useful metrics but I didn't see any $$ based
metrics there. Are there any lists that document the $-per-teraflop
(apologies to international members!) of any of the systems in the
Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't
give me anything useful.

I'm speculating, one reason could be that sites are loath to disclose
their exact $ purchase prices etc. But on the other hand for most of
the publicly owned  systems this should be accessible information
anyways. I was just thinking that this might be an interesting
parameter to track. I was also curious as to when systems become
larger is there an economy of scale in the Beowulf world? i.e. for
something like Jaguar or Kraken is the $/teraflop much lower than what
it is for my tiny 100-node system. Another question could be: Is it
cheaper to assemble 100 Teraflops of capacity in the US or WU or China
etc.

Of course, HPC is not really commoditized so a Teraflop based $ value
may not be strictly an apples-to-apples comparison but still.....

Just wondering what statistics are available out there......

-- 
Rahul


From lindahl at pbm.com  Tue Jun 29 22:50:26 2010
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 29 Jun 2010 22:50:26 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
Message-ID: <20100630055026.GF28068@bx9.net>

On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote:

> The Top500 list has many useful metrics but I didn't see any $$ based
> metrics there.

Other communities with $$-based metrics haven't had much success with
them.

In HPC, many contracts are multi-year, multi-delivery, or, they
include significant extra stuff beyond the iron.

-- greg


From prentice at ias.edu  Wed Jun 30 05:45:17 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 30 Jun 2010 08:45:17 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
Message-ID: <4C2B3C5D.9030405@ias.edu>

Even in publicly-owned computers, I'm sure the exact price of the
cluster is well hidden. As Greg said, the cost might be buried in
multi-year contract or as part of a larger contract.

These large supercomputers at national labs or large universities are
often provided by the vendor for little or no profit (maybe even a loss)
in exchange for prestige/advertising opportunities or R&D opportunities.
They make up this loss by selling to money-making corporations for a
much bigger margin.

For example, I would not be surprised if IBM practically gave away
RoadRunner to Los Alamos in exchange for the computing expertise at Los
Alamos to help develop such an architecture and then be able to say that
IBM builds the world's fastest computers (and that your company can have
one just like it, for a price). Oh, and the users at Los Alamos probably
provide lots of feedback to IBM which helps them build better systems in
the future. (Don't shoot me if I'm wrong. I'm just theorizing here)

I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a
Dept of Energy National Lab, and I can tell you many of the the systems
sold to PPPL when I worked there was sold under a NDA, preventing anyone
from discussing the price. Yes, the budgets are public information, but
that tells you the computer hardware budget for a year, not how much was
spent on each computer.

Prentice


Rahul Nabar wrote:
> The Top500 list has many useful metrics but I didn't see any $$ based
> metrics there. Are there any lists that document the $-per-teraflop
> (apologies to international members!) of any of the systems in the
> Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't
> give me anything useful.
> 
> I'm speculating, one reason could be that sites are loath to disclose
> their exact $ purchase prices etc. But on the other hand for most of
> the publicly owned  systems this should be accessible information
> anyways. I was just thinking that this might be an interesting
> parameter to track. I was also curious as to when systems become
> larger is there an economy of scale in the Beowulf world? i.e. for
> something like Jaguar or Kraken is the $/teraflop much lower than what
> it is for my tiny 100-node system. Another question could be: Is it
> cheaper to assemble 100 Teraflops of capacity in the US or WU or China
> etc.
> 
> Of course, HPC is not really commoditized so a Teraflop based $ value
> may not be strictly an apples-to-apples comparison but still.....
> 
> Just wondering what statistics are available out there......
> 

-- 
Prentice


From joshua_mora at usa.net  Wed Jun 30 06:10:47 2010
From: joshua_mora at usa.net (Joshua mora acosta)
Date: Wed, 30 Jun 2010 08:10:47 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
Message-ID: <386oFdNJv5776S01.1277903447@web01.cms.usa.net>

I think the money part will be difficult to get (it is like a politically
incorrect question).
Nevertheless, you can split the money in two parts: purchase (which I am sure
you will never get) and electric bill for kipping the system up and running
while you run HPL and when you run stream.
Then you could try at least to put the cost of the electric bill.
electric_bill_(USD)/performance_(TFLOPs)
The electric bill though will change for a given amount of kWs depending on
the contract/location you establish with the electric company. So it is
difficult to get that info as well.
So it is perhaps better to compare systems in terms of TFLOP/kW. And factor in
there what you are capable of negotiating on electric cost and purchase and
support.
Going back to the easy part:
For instance on GPU_CPU cluster based systems, you can achieve
1.1(real_DP_TFLOP)/kW with a ratio GPU/CPU=2
With that I have factored a whole rack with HW and SW stacks for 10TF real
double precission under 400K USD cost.

Joshua

------ Original Message ------
Received: 12:45 AM CDT, 06/30/2010
From: Rahul Nabar <rpnabar at gmail.com>
To: Beowulf Mailing List <beowulf at beowulf.org>
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?

> The Top500 list has many useful metrics but I didn't see any $$ based
> metrics there. Are there any lists that document the $-per-teraflop
> (apologies to international members!) of any of the systems in the
> Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't
> give me anything useful.
> 
> I'm speculating, one reason could be that sites are loath to disclose
> their exact $ purchase prices etc. But on the other hand for most of
> the publicly owned  systems this should be accessible information
> anyways. I was just thinking that this might be an interesting
> parameter to track. I was also curious as to when systems become
> larger is there an economy of scale in the Beowulf world? i.e. for
> something like Jaguar or Kraken is the $/teraflop much lower than what
> it is for my tiny 100-node system. Another question could be: Is it
> cheaper to assemble 100 Teraflops of capacity in the US or WU or China
> etc.
> 
> Of course, HPC is not really commoditized so a Teraflop based $ value
> may not be strictly an apples-to-apples comparison but still.....
> 
> Just wondering what statistics are available out there......
> 
> -- 
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From james.p.lux at jpl.nasa.gov  Wed Jun 30 06:25:06 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 30 Jun 2010 06:25:06 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
Message-ID: <C85093C2.11181%James.P.Lux@jpl.nasa.gov>


On 6/29/10 10:30 PM, "Rahul Nabar" <rpnabar at gmail.com> wrote:

> The Top500 list has many useful metrics but I didn't see any $$ based
> metrics there. Are there any lists that document the $-per-teraflop
> (apologies to international members!) of any of the systems in the
> Supercomputer / Beowulf world? Googling "dollars per teraflop" didn't
> give me anything useful.
> 
> I'm speculating, one reason could be that sites are loath to disclose
> their exact $ purchase prices etc. But on the other hand for most of
> the publicly owned  systems this should be accessible information
> anyways.
> 

It's harder than you think to come up with a "cost"  Is it just the hardware
purchase cost? Or do you count the assembly cost?  What about infrastructure
mods to hold all those racks?

And assuming you *do* get a number, how do you compare it fairly.  For
instance, if you had 100 discount PCs put on the gym floor by volunteer
labor vs buying an already integrated rack?

Do you could integration support?
Applications porting?

As far as publically funded ones go..  You might wind up with a big
accounting challenge to go through hundreds of invoices and contracts.  In
California, for instance, the CA Public Records Act says you can go and ask
for pretty much any record that doesn't have personally identifiable
information.  But that's a long way from getting a nice "here's how much the
supercomputer cost".  You might have budgets and invoices from dozens of
firms to go through and find stuff.


> 
> 
> 
>  I was just thinking that this might be an interesting
> parameter to track. I was also curious as to when systems become
> larger is there an economy of scale in the Beowulf world? i.e. for
> something like Jaguar or Kraken is the $/teraflop much lower than what
> it is for my tiny 100-node system. Another question could be: Is it
> cheaper to assemble 100 Teraflops of capacity in the US or WU or China
> etc.

I think it *is* interesting, and would be useful, because managers are
always having to make decisions about make vs buy, or when to buy (do we buy
now and get started, or wait 1 year, when the machines are faster for the
same price)


From john.hearns at mclaren.com  Wed Jun 30 06:38:04 2010
From: john.hearns at mclaren.com (Hearns, John)
Date: Wed, 30 Jun 2010 14:38:04 +0100
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
Message-ID: <68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com>

> I'm speculating, one reason could be that sites are loath to disclose
> their exact $ purchase prices etc. But on the other hand for most of
> the publicly owned  systems this should be accessible information
> anyways. I was just thinking that this might be an interesting
> parameter to track. I was also curious as to when systems become
> larger is there an economy of scale in the Beowulf world? i.e. for
> something like Jaguar or Kraken is the $/teraflop much lower than what
> it is for my tiny 100-node system. Another question could be: Is it
> cheaper to assemble 100 Teraflops of capacity in the US or WU or China

I don't think you'll find that information anywhere readily.
Also consider the difference between peak HPL flops rating and the
useful work you
will get out of a system.
Sticking my neck out slightly here, systems with lots of GPUs will score
highly on
the $$/flop ratings - but do you get that amount of work under
real-world loads?


John Hearns

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.


From landman at scalableinformatics.com  Wed Jun 30 07:13:34 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 30 Jun 2010 10:13:34 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B3C5D.9030405@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu>
Message-ID: <4C2B510E.5050406@scalableinformatics.com>

Prentice Bisbal wrote:

> These large supercomputers at national labs or large universities are
> often provided by the vendor for little or no profit (maybe even a loss)

s/maybe/usually/

> in exchange for prestige/advertising opportunities or R&D opportunities.

Hah.  Allow me to restate this.

Hah.

It is extraordinarily rare that an entity will give you permission to 
use their name in any advertising.  The most you can hope for is, 
generally, a press release.

Prestige does not translate into (profitable) revenue.

> They make up this loss by selling to money-making corporations for a
> much bigger margin.

Hmmm ....

I think there may be a fundamental disconnect between the assumptions of 
folks in academia and the reality of this particular market.  I am not 
bashing on Prentis.  I would like to point out that the "much larger 
margins" in a cutthroat business such as clusters are ... er ... not 
much larger.

I could bore people with anecdotes, but the fundamental take home 
message is, if you believe this (much larger margins bit), you are mistaken.

The market for commercial HPC is under intense downward price pressure. 
  More so than ever before.  Companies, when they have money to spend, 
want to spend less of it, and get the same or better systems/performance 
than in the past.

What we are seeing is a fundamental change of business model, over to 
one that keeps upfront capital costs as low as possible, and pushes 
things to expense columns.

This is in part what is driving the significant interest in accelerators 
/APUs, and in remote cycle rental.  The costs associated with powering 
and cooling accelerators are minimal as compared to small/mid sized 
clusters.  The costs associated with very large clusters can be made 
into pure on-demand expenses with remote cycle rental.

> For example, I would not be surprised if IBM practically gave away
> RoadRunner to Los Alamos in exchange for the computing expertise at Los
> Alamos to help develop such an architecture and then be able to say that
> IBM builds the world's fastest computers (and that your company can have
> one just like it, for a price). Oh, and the users at Los Alamos probably
> provide lots of feedback to IBM which helps them build better systems in
> the future. (Don't shoot me if I'm wrong. I'm just theorizing here)

Don't sell the folks at TJ Watson/IBM Research short.  They are a very 
bright group.  The co-R&D elements are a way IBM can dominate the HPC 
bits at the high end, and provide something that looks like an in-kind 
contribution type of model so that LANL and others can go to their 
granting agencies and get either more money, or fulfill specific 
contract points.

IBM is a business, and in most cases, won't generally have a particular 
business unit make a loss for "prestige" points.  That doesn't make the 
board/shareholders happy.

> I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a
> Dept of Energy National Lab, and I can tell you many of the the systems
> sold to PPPL when I worked there was sold under a NDA, preventing anyone
> from discussing the price. Yes, the budgets are public information, but
> that tells you the computer hardware budget for a year, not how much was
> spent on each computer.

Well part of that is to prevent shopping the quote.  We see this *all* 
the time.  Someone asks for a quote, you provide it.  They then go and 
take your quote, elide specific company information, and then send it 
around asking others to beat it.

An NDA gives you a mechanism to stop this with a specific legal 
enjoinder from discussing terms.  Copyrighting quotes and specifically 
restricting redistribution of content and information contained is 
another method.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Bill.Rankin at sas.com  Wed Jun 30 08:11:54 2010
From: Bill.Rankin at sas.com (Bill Rankin)
Date: Wed, 30 Jun 2010 15:11:54 +0000
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <386oFdNJv5776S01.1277903447@web01.cms.usa.net>
References: <386oFdNJv5776S01.1277903447@web01.cms.usa.net>
Message-ID: <76097BB0C025054786EFAB631C4A2E3C09331921@MERCMBX04R.na.SAS.com>

> I think the money part will be difficult to get (it is like a
> politically
> incorrect question).

Joe addressed this pretty well.  For the large systems, it's almost always under NDA.

> Nevertheless, you can split the money in two parts: purchase (which I
> am sure
> you will never get) and electric bill for kipping the system up and
> running while you run HPL and when you run stream.

Once you start looking at the power bills (for both the system as well as all the associated infrastructure, like cooling) then you pretty much need to start looking at approximations for the total cost of ownership (TCO).  Depending on the organization, many of these costs are well hidden.  We went through this exercise when I was at Duke (with due credit to Rob Brown who did a lot of the heavy lifting).  Some of the things you have to consider are:

- Power (for both machines and cooling).  Given commercial rates at the time (~2004) this worked out to about $1/Watt/year.  That makes for $300k/year for a 1000 node cluster at 300W/node.

- Don't forget depreciation on all that support equipment.  While your cluster may have a useful lifetime of around 3-5 years, all those air handlers, power conditioners and UPS's have lifetimes too.  Figure 10-15 years (if you can reuse them) and factor that amortized cost into your bottom line.

- Staff salaries, both for administration and operations/monitoring.  Loaded salary for a decent cluster admin may be $100k/year or more.

Bottom line is that you could spend 30%-50% (or more) additional dollars beyond the cost of the hardware just to cover the basic support needs for the facility.

-b


From prentice at ias.edu  Wed Jun 30 08:37:30 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 30 Jun 2010 11:37:30 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B510E.5050406@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu>
	<4C2B510E.5050406@scalableinformatics.com>
Message-ID: <4C2B64BA.3000704@ias.edu>

Joe Landman wrote:
> Prentice Bisbal wrote:
> 
>> These large supercomputers at national labs or large universities are
>> often provided by the vendor for little or no profit (maybe even a loss)
> 
> s/maybe/usually/
> 
>> in exchange for prestige/advertising opportunities or R&D opportunities.
> 
> Hah.  Allow me to restate this.
> 
> Hah.
> 
> It is extraordinarily rare that an entity will give you permission to
> use their name in any advertising.  The most you can hope for is,
> generally, a press release.

And what would you call all the press that IBM got for Roadrunner and
Cray got for Jaguar when those systems were at the top of the Top500?
Also, at SC09, the ORNL booth was showing off that they now had the top
system, and weren't hiding the fact that it was built by Cray. I would
call that a lot more than just a "press release". All that industry
media coverage is a lot better advertising than any single paid
advertisement.

> 
> Prestige does not translate into (profitable) revenue.
> 

I think you meant to say "Prestige does not always translate into
profitable revenue.

Rolls Royce, Bentley, and Lamborghini are just a few examples of
prestige not translating into profits.

Lexus, Acura, and Infiniti are all examples of prestige translating into
huge profits. Lexus, Acura, and Infiniti autos aren't radically
different from the Toyotas, Hondas, and Nissans they are based on, but
the cost more, mostly because of the prestige of the upmarket name.


>> They make up this loss by selling to money-making corporations for a
>> much bigger margin.
> 
> Hmmm ....
> 
> I think there may be a fundamental disconnect between the assumptions of
> folks in academia and the reality of this particular market.  I am not
> bashing on Prentis.  I would like to point out that the "much larger
> margins" in a cutthroat business such as clusters are ... er ... not
> much larger.
> 
> I could bore people with anecdotes, but the fundamental take home
> message is, if you believe this (much larger margins bit), you are
> mistaken.

Any profit is "much larger" than a loss.

> 
>> For example, I would not be surprised if IBM practically gave away
>> RoadRunner to Los Alamos in exchange for the computing expertise at Los
>> Alamos to help develop such an architecture and then be able to say that
>> IBM builds the world's fastest computers (and that your company can have
>> one just like it, for a price). Oh, and the users at Los Alamos probably
>> provide lots of feedback to IBM which helps them build better systems in
>> the future. (Don't shoot me if I'm wrong. I'm just theorizing here)
> 
> Don't sell the folks at TJ Watson/IBM Research short.  They are a very
> bright group.  The co-R&D elements are a way IBM can dominate the HPC
> bits at the high end, and provide something that looks like an in-kind
> contribution type of model so that LANL and others can go to their
> granting agencies and get either more money, or fulfill specific
> contract points.

I wasn't selling the brains IBM short. We all know that two heads are
better then one when tackling a problem. I was saying that if IBM
geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL
geniuses = better.

> 
> IBM is a business, and in most cases, won't generally have a particular
> business unit make a loss for "prestige" points.  That doesn't make the
> board/shareholders happy.
> 

How much profit did IBM make off of Deep Blue when it beat Gary
Kasparov? None that I know of. However, it did provide IBM R&D
opportunities and when it finally beat Gary Kasparov, plenty of free
advertising for IBM through news coverage, and ...prestige.


-- 
Prentice


From landman at scalableinformatics.com  Wed Jun 30 09:10:17 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 30 Jun 2010 12:10:17 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B64BA.3000704@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu>
Message-ID: <4C2B6C69.60309@scalableinformatics.com>

Prentice Bisbal wrote:

>> It is extraordinarily rare that an entity will give you permission to
>> use their name in any advertising.  The most you can hope for is,
>> generally, a press release.
> 
> And what would you call all the press that IBM got for Roadrunner and
> Cray got for Jaguar when those systems were at the top of the Top500?

"extraordinarily rare"

> Also, at SC09, the ORNL booth was showing off that they now had the top
> system, and weren't hiding the fact that it was built by Cray. I would
> call that a lot more than just a "press release". All that industry

Again, academic/national labs gain prestige points by doing this. 
Prestige rarely, if ever, turns into revenue, and even less frequently, 
into profit.

> media coverage is a lot better advertising than any single paid
> advertisement.

Hmm .... see above.  Did this media coverage inspire you to purchase a 
Blue Gene?  Or an XT6?

> 
>> Prestige does not translate into (profitable) revenue.
>>
> 
> I think you meant to say "Prestige does not always translate into
> profitable revenue.

Thank you ... I agree, I implied the "extraordinarily rare" aspect of 
this.

Basically prestige plus $4 + change will get you a Grande sized mocha 
triple shot from Starbucks.  Not much more than that.  This isn't a 
jaundiced view (ok, I hope not!) of the situation ... for-profit 
companies can't afford to be paid in "prestige".

Put it another way.  Supposed the IAS's granting agencies decided that 
they would provide grants that didn't cover complete costs, but provided 
"prestige" for getting the grant in the first place.  Even though IAS 
isn't a for-profit institution per se, it still has bills to pay, and 
people to employ.  This sort of scenario is the case in for-profit HPC 
in academia.  The prestige earned doesn't pay the bills.  You can't take 
the prestige to the bank and earn interest on it.  Its an intangible ... 
good will ... is how I think it is accounted for.

But is it valuable?  Ask IBM how many more BG's the've sold as a result 
of that "marketing campaign".  As I said, I think there is a fundamental 
disconnect between what people would like to believe, and the (rather 
harsh) realities of the market.  This is not a bash.  This is an 
observation.

> Rolls Royce, Bentley, and Lamborghini are just a few examples of
> prestige not translating into profits.

Bentley had been up for sale, RR is now owned by Tata, as is Jaguar.

Prestige does not pay the bills.  And if I am wrong on this, please, 
educate me ... I'd like to figure out how to do this.

> 
> Lexus, Acura, and Infiniti are all examples of prestige translating into
> huge profits. Lexus, Acura, and Infiniti autos aren't radically

Er ... no.  They are all examples of really good marketing, and not 
trying to compete on price.

> different from the Toyotas, Hondas, and Nissans they are based on, but
> the cost more, mostly because of the prestige of the upmarket name.

Thats marketing, not prestige.  When you introduce a new brand, it 
doesn't have a history, and hence, no prestige.

Marketing is what enables you to attract customers.  Some customers will 
be put off by price.  You either lower your pricing to keep them, or you 
ignore them as a market.  The folks you indicated, all ignore the price 
sensitive elements of the market.  Which makes them vulnerable to attack 
by the Hyundai's and others of the world.

> 
> 
>>> They make up this loss by selling to money-making corporations for a
>>> much bigger margin.
>> Hmmm ....
>>
>> I think there may be a fundamental disconnect between the assumptions of
>> folks in academia and the reality of this particular market.  I am not
>> bashing on Prentis.  I would like to point out that the "much larger
>> margins" in a cutthroat business such as clusters are ... er ... not
>> much larger.
>>
>> I could bore people with anecdotes, but the fundamental take home
>> message is, if you believe this (much larger margins bit), you are
>> mistaken.
> 
> Any profit is "much larger" than a loss.

So is a $1 USD profit on a $1M USD sale a reasonable profit?  And if the 
loss is $100 USD on the $1M USD sale, is the $1 >> $100 ?

No.

> 
>>> For example, I would not be surprised if IBM practically gave away
>>> RoadRunner to Los Alamos in exchange for the computing expertise at Los
>>> Alamos to help develop such an architecture and then be able to say that
>>> IBM builds the world's fastest computers (and that your company can have
>>> one just like it, for a price). Oh, and the users at Los Alamos probably
>>> provide lots of feedback to IBM which helps them build better systems in
>>> the future. (Don't shoot me if I'm wrong. I'm just theorizing here)
>> Don't sell the folks at TJ Watson/IBM Research short.  They are a very
>> bright group.  The co-R&D elements are a way IBM can dominate the HPC
>> bits at the high end, and provide something that looks like an in-kind
>> contribution type of model so that LANL and others can go to their
>> granting agencies and get either more money, or fulfill specific
>> contract points.
> 
> I wasn't selling the brains IBM short. We all know that two heads are
> better then one when tackling a problem. I was saying that if IBM
> geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL
> geniuses = better.

Ahh .... ok.  IBM has some really good folks at their research locations.

> 
>> IBM is a business, and in most cases, won't generally have a particular
>> business unit make a loss for "prestige" points.  That doesn't make the
>> board/shareholders happy.
>>
> 
> How much profit did IBM make off of Deep Blue when it beat Gary
> Kasparov? None that I know of. However, it did provide IBM R&D
> opportunities and when it finally beat Gary Kasparov, plenty of free
> advertising for IBM through news coverage, and ...prestige.

... which, as you note, translated to no profit.  The press however, 
provided them effectively free marketing.  Publish this story, and we 
don't have to pay for it.

... which is worth ... what?

How many BG's did IBM sell as a result of the chess match?  How many 
people made a decision, influenced in part, by that PR and free marketing?

Thats the point.  Like it or not, prestige doesn't placate shareholders, 
board members, or wall street.  They want to see profits, pure and simple.

> 
> 


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From rpnabar at gmail.com  Wed Jun 30 09:17:28 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 30 Jun 2010 11:17:28 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<68A57CCFD4005646957BD2D18E60667B10F1762E@milexchmb1.mil.tagmclarengroup.com>
Message-ID: <AANLkTilZVDZZhmsyESWzU_xA8eLVrTq4MhXWUh_FtdEi@mail.gmail.com>

On Wed, Jun 30, 2010 at 8:38 AM, Hearns, John <john.hearns at mclaren.com> wrote:
> Sticking my neck out slightly here, systems with lots of GPUs will score
> highly on
> the $/flop ratings - but do you get that amount of work under
> real-world loads?

Sticking my neck out even more, but maybe that problem can be solved
by using the actual Teraflops as opposed to peak-teraflops? I'm
curious, how good are these GPU based systems when solving something
like the Linpack or the SPEC benchmarks? Of course, I'm not saying
either of these benchmarks are representative of a "real-world" load.
But perhaps they are closer than a peak-teraflops metric.

-- 
Rahul


From joshua_mora at usa.net  Wed Jun 30 09:42:39 2010
From: joshua_mora at usa.net (Joshua mora acosta)
Date: Wed, 30 Jun 2010 11:42:39 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
Message-ID: <429oFdqpn6192S04.1277916159@web04.cms.usa.net>

Now it comes the funny part:
Out of that electric bill of say 300k for 1000 nodes at a reasonable
efficiency of 30% for a real well tuned application (not Linpack which is
>80%) it basically means 90K USD is worth the work. The other 210K US is
electric bill wasted waiting for data to hit the caches.

Joshua

------ Original Message ------
Received: 10:12 AM CDT, 06/30/2010
From: Bill Rankin <Bill.Rankin at sas.com>
To: Joshua mora acosta <joshua_mora at usa.net>, Rahul Nabar <rpnabar at gmail.com>,
Beowulf Mailing List <beowulf at beowulf.org>
Subject: RE: [Beowulf] dollars-per-teraflop : any lists like the Top500?

> > I think the money part will be difficult to get (it is like a
> > politically
> > incorrect question).
> 
> Joe addressed this pretty well.  For the large systems, it's almost always
under NDA.
> 
> > Nevertheless, you can split the money in two parts: purchase (which I
> > am sure
> > you will never get) and electric bill for kipping the system up and
> > running while you run HPL and when you run stream.
> 
> Once you start looking at the power bills (for both the system as well as
all the associated infrastructure, like cooling) then you pretty much need to
start looking at approximations for the total cost of ownership (TCO). 
Depending on the organization, many of these costs are well hidden.  We went
through this exercise when I was at Duke (with due credit to Rob Brown who did
a lot of the heavy lifting).  Some of the things you have to consider are:
> 
> - Power (for both machines and cooling).  Given commercial rates at the time
(~2004) this worked out to about $1/Watt/year.  That makes for $300k/year for
a 1000 node cluster at 300W/node.
> 
> - Don't forget depreciation on all that support equipment.  While your
cluster may have a useful lifetime of around 3-5 years, all those air
handlers, power conditioners and UPS's have lifetimes too.  Figure 10-15 years
(if you can reuse them) and factor that amortized cost into your bottom line.
> 
> - Staff salaries, both for administration and operations/monitoring.  Loaded
salary for a decent cluster admin may be $100k/year or more.
> 
> Bottom line is that you could spend 30%-50% (or more) additional dollars
beyond the cost of the hardware just to cover the basic support needs for the
facility.
> 
> -b
> 
> 


From james.p.lux at jpl.nasa.gov  Wed Jun 30 09:49:24 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 30 Jun 2010 09:49:24 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B510E.5050406@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com>
Message-ID: <ECE7A93BD093E1439C20020FBE87C47FEDAF81A9F3@ALTPHYEMBEVSP20.RES.AD.JPL>

Comments interspersed below... Joe's comments are generally right on, and I can provide some insight into how governments buy stuff (it's pretty strictly regulated.. far more so than in private industry, but some of the processes seem arcane and bewildering at first glance)
Jim


> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman
> 
> Prentice Bisbal wrote:
> 
> > These large supercomputers at national labs or large universities are
> > often provided by the vendor for little or no profit (maybe even a loss)
> 
> s/maybe/usually/
> 
> > in exchange for prestige/advertising opportunities or R&D opportunities.
> 
> Hah.  Allow me to restate this.
> 
> Hah.
> 
> It is extraordinarily rare that an entity will give you permission to
> use their name in any advertising.  The most you can hope for is,
> generally, a press release.


I agree. Here at JPL we have pretty stringent rules under which we can do things in terms of using the name of JPL or NASA.  Granted, nothing stops you advertising your product "as sold to NASA", but under no circumstances could we provide any sort of endorsement or recommendation.  We can co-author a paper or report with the vendor which reports information from some activity. We can say what we did, and make generalized fact based recommendations.  


> 
> Prestige does not translate into (profitable) revenue.
> 
> > They make up this loss by selling to money-making corporations for a
> > much bigger margin.
> 
> Hmmm ....
> 
> I think there may be a fundamental disconnect between the assumptions of
> folks in academia and the reality of this particular market.  I am not
> bashing on Prentis.  I would like to point out that the "much larger
> margins" in a cutthroat business such as clusters are ... er ... not
> much larger.


I agree here, too. My wife works in commercial IT, and I'd say that they tend to beat the vendors down more than we do here in government funded work.  She (and her bosses) have to report against monthly, quarterly, and annual targets for everything.  Government work tends to be funded on an annual cycle (October 1st is the FY start date.. funding gets set based on proposals in March/April, and cast into concrete around now for the next FY).  

Where government work is concerned, the problems usually aren't profit margin (we're happy to give a reasonable rate of return, because screwing the clamps down onto the vendor's last penny usually doesn't work out well.. they'll put their top people on the jobs that have decent returns, especially if your job gets into difficulties).  It's the myriad other weird and not so weird  conditions (Drug Free Workplace Act, Buy American Act, Foreign Corrupt Practices Act, etc.) resulting from the dual role of government procurement: get something useful done and create social policy. That, and the absolute paranoia that the taxpayer might be getting the short end of the stick results in a substantially larger paperwork burden to prove they aren't.  All those folks standing up decrying "waste, fraud, and abuse"... In a commercial entity, one can do a cost benefit trade on, say, inventory losses vs time/effort to keep track of things.  Not in government, when someone will be sure to stand up and say "Agency X lost or misplaced 3 laptops out of the 123,000 they have, and this must stop now!"

Government work also often pays slow, but reliably.  But your bank may not understand, so getting operating capital can be a challenge. 

> What we are seeing is a fundamental change of business model, over to
> one that keeps upfront capital costs as low as possible, and pushes
> things to expense columns.

Yes.. it makes *this week/month/quarter's* numbers look better, and also gets you out of having to seek capital (which has been hard recently in these odd-times for credit).


> 
> > For example, I would not be surprised if IBM practically gave away
> > RoadRunner to Los Alamos in exchange for the computing expertise at Los
> > Alamos to help develop such an architecture and then be able to say that
> > IBM builds the world's fastest computers (and that your company can have
> > one just like it, for a price). Oh, and the users at Los Alamos probably
> > provide lots of feedback to IBM which helps them build better systems in
> > the future. (Don't shoot me if I'm wrong. I'm just theorizing here)
> 
> Don't sell the folks at TJ Watson/IBM Research short.  They are a very
> bright group.  The co-R&D elements are a way IBM can dominate the HPC
> bits at the high end, and provide something that looks like an in-kind
> contribution type of model so that LANL and others can go to their
> granting agencies and get either more money, or fulfill specific
> contract points.

This sort of thing is often done under a "Cooperative Research and Development Agreement" or CRADA.  This is basically a contract between the govt and the vendor which lays out who is doing what, what they're bringing to the table, and where the intellectual property rights will wind up.  For example, I'm working on a space mission now where several vendors have provided equipment under CRADAs. I don't know anything about what's in the agreement, but in general, it's something along the lines of "we give you a ride into space and you get to fly your box and test your new technology"  International MoUs for science instruments also work this way.  For NASA this is all covered under the "Space Act"

> 
> IBM is a business, and in most cases, won't generally have a particular
> business unit make a loss for "prestige" points.  That doesn't make the
> board/shareholders happy.

You got that right.. You'd have to put a number on the prestige and trade it against someone's advertising budget.


> 
> > I used to work at the Princeton Plasma Physics Lab (www.pppl.gov), a
> > Dept of Energy National Lab, and I can tell you many of the the systems
> > sold to PPPL when I worked there was sold under a NDA, preventing anyone
> > from discussing the price. Yes, the budgets are public information, but
> > that tells you the computer hardware budget for a year, not how much was
> > spent on each computer.
> 
> Well part of that is to prevent shopping the quote.  We see this *all*
> the time.  Someone asks for a quote, you provide it.  They then go and
> take your quote, elide specific company information, and then send it
> around asking others to beat it.

Exactly.. we do this with "source selection" all the time.  The proposals are proprietary and confidential, the reviewers on the Source Evaluation Board all sign NDAs and go work in a special room where all the materials are kept. It's a very, very big deal (even for relatively small procurements).  Once the selection is made, then we shred all the materials, and go to negotiate the actual contract with the vendor (which can't change substantially from what they proposed, because otherwise the losing vendors can legitimately complain).  

During the negotiation process (especially if it is a cost plus fixed fee) the vendor will need to provide a lot of financial details to allow our people to determine if the price is "fair" and that the vendor is giving the government the "lowest price" (if you want to stay out of trouble, do not sell us something for $1000 and then sell it to someone else for $900).  That financial detail is generally proprietary (e.g. as a vendor you don't want us telling everyone how much your people are paid and how much you pay in fringe benefits) and wouldn't be disclosed, but the total contract value, and a fair amount of other information, is disclosed.  


Jim


From prentice at ias.edu  Wed Jun 30 12:43:27 2010
From: prentice at ias.edu (Prentice Bisbal)
Date: Wed, 30 Jun 2010 15:43:27 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B6C69.60309@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com>
Message-ID: <4C2B9E5F.2010602@ias.edu>

I'd like to apologize to other beowulfers for going way off-topic. This
will be my last post on this topic.

Joe Landman wrote:
> Prentice Bisbal wrote:
> 
>>> It is extraordinarily rare that an entity will give you permission to
>>> use their name in any advertising.  The most you can hope for is,
>>> generally, a press release.
>>
>> And what would you call all the press that IBM got for Roadrunner and
>> Cray got for Jaguar when those systems were at the top of the Top500?
> 
> "extraordinarily rare"
> 
>> Also, at SC09, the ORNL booth was showing off that they now had the top
>> system, and weren't hiding the fact that it was built by Cray. I would
>> call that a lot more than just a "press release". All that industry
> 
> Again, academic/national labs gain prestige points by doing this.
> Prestige rarely, if ever, turns into revenue, and even less frequently,
> into profit.
> 
>> media coverage is a lot better advertising than any single paid
>> advertisement.
> 
> Hmm .... see above.  Did this media coverage inspire you to purchase a
> Blue Gene?  Or an XT6?
> 

No because Roadrunner was not a Blue Gene system ;).

We need to look beyond Roadrunner selling more roadrunner-like systems
and Jaguar selling more Jaguars.

The success of Roadrunner and Deep Blue probably didn't sell more
Roadrunners and Deep Blues, but I'm sure they had an effect on IBMs
stock price, and help sell lower-end IBM systems. IBM dominates the Top
500 right now. I'm sure their success with Roadrunner, Deep Blue, and
Blue Gene have something to do with that.

If not direct technology transfer, I bet Bob at Acme thinks to himself
"IBM has done a lot of great things in supercomputing. They're
definitely the experts. I think we should hire them to build and
integrate a new 128-node cluster for our comp. chem group.


>>
>>> Prestige does not translate into (profitable) revenue.
>>>
>>
>> I think you meant to say "Prestige does not always translate into
>> profitable revenue.
> 
> Thank you ... I agree, I implied the "extraordinarily rare" aspect of this.
> 
> Basically prestige plus $4 + change will get you a Grande sized mocha
> triple shot from Starbucks.  Not much more than that.  This isn't a
> jaundiced view (ok, I hope not!) of the situation ... for-profit
> companies can't afford to be paid in "prestige".
> 
> Put it another way.  Supposed the IAS's granting agencies decided that
> they would provide grants that didn't cover complete costs, but provided
> "prestige" for getting the grant in the first place.  Even though IAS
> isn't a for-profit institution per se, it still has bills to pay, and
> people to employ.  This sort of scenario is the case in for-profit HPC
> in academia.  The prestige earned doesn't pay the bills.  You can't take
> the prestige to the bank and earn interest on it.  Its an intangible ...
> good will ... is how I think it is accounted for.
> 
> But is it valuable?  Ask IBM how many more BG's the've sold as a result
> of that "marketing campaign".  As I said, I think there is a fundamental
> disconnect between what people would like to believe, and the (rather
> harsh) realities of the market.  This is not a bash.  This is an
> observation.
> 
>> Rolls Royce, Bentley, and Lamborghini are just a few examples of
>> prestige not translating into profits.
> 
> Bentley had been up for sale, RR is now owned by Tata, as is Jaguar.
> 
> Prestige does not pay the bills.  And if I am wrong on this, please,
> educate me ... I'd like to figure out how to do this.

Ask Rolex or Patek Philippe. I'm sure the only reason people drop large
$$ on their watches is for the prestige of the name.

(By the way - I met several Patek Philippe workers in NYC once. To them
Rolex might as well be Casio - they get insulted if you compare their
watches to Rolex)

> 
>>
>> Lexus, Acura, and Infiniti are all examples of prestige translating into
>> huge profits. Lexus, Acura, and Infiniti autos aren't radically
> 
> Er ... no.  They are all examples of really good marketing, and not
> trying to compete on price.
> 
>> different from the Toyotas, Hondas, and Nissans they are based on, but
>> the cost more, mostly because of the prestige of the upmarket name.
> 
> Thats marketing, not prestige.  When you introduce a new brand, it
> doesn't have a history, and hence, no prestige.

Marketing and prestige go hand in hand. Toyota's aggressive marketing of
Lexus as a viable alternative to Mercedes gave it prestige.

> 
> Marketing is what enables you to attract customers.  Some customers will
> be put off by price.  You either lower your pricing to keep them, or you
> ignore them as a market.  The folks you indicated, all ignore the price
> sensitive elements of the market.  Which makes them vulnerable to attack
> by the Hyundai's and others of the world.
> 
>>
>>
>>>> They make up this loss by selling to money-making corporations for a
>>>> much bigger margin.
>>> Hmmm ....
>>>
>>> I think there may be a fundamental disconnect between the assumptions of
>>> folks in academia and the reality of this particular market.  I am not
>>> bashing on Prentis.  I would like to point out that the "much larger
>>> margins" in a cutthroat business such as clusters are ... er ... not
>>> much larger.
>>>
>>> I could bore people with anecdotes, but the fundamental take home
>>> message is, if you believe this (much larger margins bit), you are
>>> mistaken.
>>
>> Any profit is "much larger" than a loss.
> 
> So is a $1 USD profit on a $1M USD sale a reasonable profit?  And if the
> loss is $100 USD on the $1M USD sale, is the $1 >> $100 ?
> 
> No.
> 
>>
>>>> For example, I would not be surprised if IBM practically gave away
>>>> RoadRunner to Los Alamos in exchange for the computing expertise at Los
>>>> Alamos to help develop such an architecture and then be able to say
>>>> that
>>>> IBM builds the world's fastest computers (and that your company can
>>>> have
>>>> one just like it, for a price). Oh, and the users at Los Alamos
>>>> probably
>>>> provide lots of feedback to IBM which helps them build better
>>>> systems in
>>>> the future. (Don't shoot me if I'm wrong. I'm just theorizing here)
>>> Don't sell the folks at TJ Watson/IBM Research short.  They are a very
>>> bright group.  The co-R&D elements are a way IBM can dominate the HPC
>>> bits at the high end, and provide something that looks like an in-kind
>>> contribution type of model so that LANL and others can go to their
>>> granting agencies and get either more money, or fulfill specific
>>> contract points.
>>
>> I wasn't selling the brains IBM short. We all know that two heads are
>> better then one when tackling a problem. I was saying that if IBM
>> geniuses = good, and LANL geniuses = good, then IBM geniuses + LANL
>> geniuses = better.
> 
> Ahh .... ok.  IBM has some really good folks at their research locations.
> 
>>
>>> IBM is a business, and in most cases, won't generally have a particular
>>> business unit make a loss for "prestige" points.  That doesn't make the
>>> board/shareholders happy.
>>>
>>
>> How much profit did IBM make off of Deep Blue when it beat Gary
>> Kasparov? None that I know of. However, it did provide IBM R&D
>> opportunities and when it finally beat Gary Kasparov, plenty of free
>> advertising for IBM through news coverage, and ...prestige.
> 
> ... which, as you note, translated to no profit.  The press however,
> provided them effectively free marketing.  Publish this story, and we
> don't have to pay for it.
> 
> ... which is worth ... what?
> 
> How many BG's did IBM sell as a result of the chess match?  How many
> people made a decision, influenced in part, by that PR and free marketing?
> 
> Thats the point.  Like it or not, prestige doesn't placate shareholders,
> board members, or wall street.  They want to see profits, pure and simple.

See my first inline comment.

Let's end this discussion, before I get the same reputation as that
crazy dutch guy.

-- 
Prentice


From rpnabar at gmail.com  Wed Jun 30 13:21:08 2010
From: rpnabar at gmail.com (Rahul Nabar)
Date: Wed, 30 Jun 2010 15:21:08 -0500
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <20100630055026.GF28068@bx9.net>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<20100630055026.GF28068@bx9.net>
Message-ID: <AANLkTin7WXlXH5Z2BuYb7pKaOIWAEOZdSCTJshSJSboW@mail.gmail.com>

On Wed, Jun 30, 2010 at 12:50 AM, Greg Lindahl <lindahl at pbm.com> wrote:
> Other communities with $-based metrics haven't had much success with
> them.
>
> In HPC, many contracts are multi-year, multi-delivery, or, they
> include significant extra stuff beyond the iron.

Thanks Beowulfers for some interesting comments and discussion there.
I guess, the conclusion (as I see it ) is : At the present time there
is no good (read "easy", "direct" , "quick" etc.) way of benchmarking
the $ costs of my system against others in the ecosystem.

For what it's worth my own estimate of our circa. 2010 cluster is
$35k/Teraflop (peak).

-- 
Rahul


From hahn at mcmaster.ca  Wed Jun 30 13:45:09 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 30 Jun 2010 16:45:09 -0400 (EDT)
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <AANLkTin7WXlXH5Z2BuYb7pKaOIWAEOZdSCTJshSJSboW@mail.gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<20100630055026.GF28068@bx9.net>
	<AANLkTin7WXlXH5Z2BuYb7pKaOIWAEOZdSCTJshSJSboW@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1006301639010.27937@coffee.psychology.mcmaster.ca>

> For what it's worth my own estimate of our circa. 2010 cluster is
> $35k/Teraflop (peak).

my organization is trying to finalize a cluster that's about $CAD 30/TF.


From rchang.lists at gmail.com  Wed Jun 30 18:54:54 2010
From: rchang.lists at gmail.com (Richard Chang)
Date: Thu, 01 Jul 2010 07:24:54 +0530
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B9E5F.2010602@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>
	<4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu>
Message-ID: <4C2BF56E.7090303@gmail.com>

An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100701/b1d42816/attachment.html>

From landman at scalableinformatics.com  Wed Jun 30 19:41:01 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 30 Jun 2010 22:41:01 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2BF56E.7090303@gmail.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>	<4C2B6C69.60309@scalableinformatics.com>
	<4C2B9E5F.2010602@ias.edu> <4C2BF56E.7090303@gmail.com>
Message-ID: <4C2C003D.8000606@scalableinformatics.com>

Richard Chang wrote:


> You are right when you said that these big companies sell their stuff at 
> a huge discount, atleast initially,is what I know.
> 
> Here in India, where I live and work, IBM had, 3-4 yrs back, sold a BG/L 
> for way less than anything, virtually at the price of a normal cluster. 
> This was done to cut out the competition and to boast about their system 
> being sold i.e, the first ever BG/L being sold in India. The 
> competition, as expected, was very livid that could IBM give it off at 
> such throw away prices.

... we (in the business) call that "buying the business".  You literally 
pay your customer to take your system.  It doesn't take many of these to 
get senior execs asking where the profit is.  That is, if you look at 
this as an investment, what is the return on this investment?

My argument is that the return is nearly to identically zero.  My 
rationale for this argument comes from the fact that once a customer 
learns that someone else got a great deal, they also demand a similar 
deal.  This is the segue to the NDA bit earlier.  So, unless you hide 
the details of your sale, your margins will be impacted on nearly every 
sale.

Does prestige translate into increased revenue?  Lets ask on this list 
(self selecting, probably not statistically valid, but may give a rough 
picture):

Question for the list members whom have bought (large-ish) clusters/HPC 
systems:  Was your selection influenced by the heroic class systems 
sales?  Did you purposefully buy from the same vendor because of this, 
or was this a significant contributing factor in your decision process?

Feel free to answer offline and anonymously if you'd like (I'll post the 
question on http://scalability.org as well ... not a commercial site, no 
adverts there, and we already have quite a bit of daily traffic ... no 
astroturfing going on here).

> Did IBM make a profit, I doubt it. Its another matter that this prestige 
> didn't give them enough mileage. It didn't start selling BG/L s like hot 
> cakes. It certainly gave them a boasting ground.

Thats my point.  Prestige doesn't normally translate into sales. 
Prestige gives you something to talk about, over that $4 USD cup of 
coffee from Starbucks.

Put another way, who won the various races over the wilderness isn't 
likely to influence many SUV buyers as to whether they should pick a 
particular brand.  Prestige is a talking point ... something like "hey, 
did you know ..."


> The subsequent quotes were very high that they couldn't win the 
> contracts. I was once told by a reseller that IBM's higher-ups decided 
> against further discounts(they will need to start making money). :-)

Yeah ... this happens.  If you start buying the business (won't mention 
any vendor names here), pretty soon you reach a point where a senior VP 
or the CEO looks at the profit and loss for each division/group, and 
notices one little one ... these HPC folks ... are bleeding capital. 
Unless that bleeding (also called 'investment' above in a somewhat 
semi-euphemistic manner) can be turned around (also called 'return on 
investment' above in a somewhat semi-euphemistic manner), and they can 
start showing a profit, that exec is going to think twice about 
continuing that line of business.

> So, the point here is that though prestige is ! = profit, it surely 
> helps their reputation.

Absolutely.

If Prentis and his team at IAS bought a huge storage cluster at a very 
low margin from us, it wouldn't likely translate to a sale somewhere 
else, even if we could use IAS's name (we couldn't).  The prestige is a 
badge of honor, not a sales tool.


> Richard.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From landman at scalableinformatics.com  Wed Jun 30 20:11:38 2010
From: landman at scalableinformatics.com (Joe Landman)
Date: Wed, 30 Jun 2010 23:11:38 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B9E5F.2010602@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>	<4C2B3C5D.9030405@ias.edu>	<4C2B510E.5050406@scalableinformatics.com>	<4C2B64BA.3000704@ias.edu>
	<4C2B6C69.60309@scalableinformatics.com> <4C2B9E5F.2010602@ias.edu>
Message-ID: <4C2C076A.5010703@scalableinformatics.com>

Prentice Bisbal wrote:
> I'd like to apologize to other beowulfers for going way off-topic. This
> will be my last post on this topic.

Actually given the light volume on the list, its not too bad ... and it
is on topic in the business sense.

At the end of the day, the fundamental question we are debating is, does
the "prestige" of working with a top university/national lab have any
real tangible value that you can ascribe to the bottom line, does it
actually impact sales.

I posit that the answer to this is a resounding "no".  You obviously
disagree.

This is the business side of HPC.  Its definitely relevant to
beowulfery, which seeks to minimize cost per cycle.

[...]

>> Hmm .... see above.  Did this media coverage inspire you to purchase a
>> Blue Gene?  Or an XT6?
>>
> 
> No because Roadrunner was not a Blue Gene system ;).

Irrelevant to the argument.  Did, the prestige of a particular system at
the very high end induce you to buy a similar one?  I don't think you
answered affirmatively on this.

> We need to look beyond Roadrunner selling more roadrunner-like systems
> and Jaguar selling more Jaguars.

Well, no.  This is what was implied, that the prestige has follow on
economic value.  I posit it doesn't.

> The success of Roadrunner and Deep Blue probably didn't sell more

s/probably//

> Roadrunners and Deep Blues, but I'm sure they had an effect on IBMs
> stock price, and help sell lower-end IBM systems. IBM dominates the Top

Well, here is where it gets murky.  Can you, with any specificity,
indicate what the impact upon IBM's stock price (e.g. increase in market
valuation) selling a machine under its actual cost, had upon the
company?  I *can* point you to their bottom line and show you where that
decreased by exactly the amount they may have lost in selling this
machine (IBM is smart, they generally don't do business when they will
lose money, they try to at least break even).

You can *always* see the net impact of these sorts of "prestige" sales.
  Revenue increases, and profits stay flat.

Like it or not, wall street punishes you when this happens.  This means
your gross and net margins drop.  So if the stock price rose more than
the net margins dropped ... then you *might* be able to ascribe value to
that.  The "I'm sure that..." doesn't fly here.  Any argument that
starts like that isn't going to win you friends in the financial
community.  The ones who do ascribe value to IBM's stock, and the price
in the increased business risks associated with lower margins.

> 500 right now. I'm sure their success with Roadrunner, Deep Blue, and
> Blue Gene have something to do with that.

Again, see above.  This is not likely to be the case.

> 
> If not direct technology transfer, I bet Bob at Acme thinks to himself
> "IBM has done a lot of great things in supercomputing. They're
> definitely the experts. I think we should hire them to build and
> integrate a new 128-node cluster for our comp. chem group.

We don't see that.  Anyone on this list care to comment?

This is a good question for the list:  Did you buy an IBM because of
Roadrunner?  Did you buy a Cray because of Jaguar?

Or are your purchase decisions largely a function of budget, suitability 
to purpose, technological considerations,  and ... how big of a discount 
you got?

I suspect the latter.  I don't think many folks were influenced to any 
significant degree by the heroic class systems, other than to say "cool".


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Daniel.Pfenniger at unige.ch  Wed Jun 30 21:06:12 2010
From: Daniel.Pfenniger at unige.ch (Pfenniger Daniel)
Date: Thu, 01 Jul 2010 06:06:12 +0200
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2C003D.8000606@scalableinformatics.com>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu> <4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu> <4C2B6C69.60309@scalableinformatics.com>
	<4C2B9E5F.2010602@ias.edu> <4C2BF56E.7090303@gmail.com>
	<4C2C003D.8000606@scalableinformatics.com>
Message-ID: <4C2C1434.9030307@unige.ch>

Joe Landman wrote:
> Richard Chang wrote:
> 
....
> Does prestige translate into increased revenue?  Lets ask on this list
> (self selecting, probably not statistically valid, but may give a rough
> picture):

In non-US wealthy countries the Top 500 list is a powerful argument to
get HPC hardware from governmental funding agencies.  The country is too
low on the list according to the national ego?  Then for sure
there will be some additional money for correcting the disgrace.
And then the hardware will be purchased at list price.

> ...

Dan


From james.p.lux at jpl.nasa.gov  Wed Jun 30 21:24:48 2010
From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C))
Date: Wed, 30 Jun 2010 21:24:48 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2C076A.5010703@scalableinformatics.com>
Message-ID: <C85166A0.111E3%James.P.Lux@jpl.nasa.gov>


On 6/30/10 8:11 PM, "Joe Landman" <landman at scalableinformatics.com> wrote:
> 
> 
> Or are your purchase decisions largely a function of budget, suitability
> to purpose, technological considerations,  and ... how big of a discount
> you got?
> 
> I suspect the latter.  I don't think many folks were influenced to any
> significant degree by the heroic class systems, other than to say "cool".
> 


I can posit that such heroic systems might convince your upper management to
allow you to buy *any* cluster, especially if it's from the "itty bitty
monopoly"<grin>

But as Joe points out, when it actually comes to buying, the gimlet eyes of
the green eyeshade brigade will be cast over the bids.  It's all about cost
(whether capital or life cycle).  Nobody is going to pay more just because
the vendor did a stunt of one sort or another.


From forum.san at gmail.com  Wed Jun 30 21:51:09 2010
From: forum.san at gmail.com (Sangamesh B)
Date: Thu, 1 Jul 2010 10:21:09 +0530
Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux machine?
Message-ID: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>

Dear All,

          We're in a process of implementing a centralized FlexLM license
server for multiple commercial applications. Can some one tell us, whether
Linux OS support multiple lmgrd services or not? If its not directly, is
there a way to do it?

For example, can we install FlexLM license servers of both ANSYS and STAR CD
on a single linux server?

Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100701/678ad26b/attachment.html>

From hahn at mcmaster.ca  Wed Jun 30 22:21:47 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 1 Jul 2010 01:21:47 -0400 (EDT)
Subject: [Beowulf] Multiple FlexLM lmgrd services on a single Linux
	machine?
In-Reply-To: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
References: <AANLkTikhll_4PABr2eIxxmrdtGVrqKZIJUQpdJFs6eae@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1007010109290.813@coffee.psychology.mcmaster.ca>

> Linux OS support multiple lmgrd services or not? If its not directly, is
> there a way to do it?

I don't really understand what you're asking.  yes, linux provides fully
functional TCP/IP.  yes, flexlm can run either with a merged license file
(single base port, multiple vendor ports), or with multiple completely
separate instances (listening on say, ports 27000+27001 and 28000+28001).
the latter is often more convenient, since it means you can adjust one
instance without affecting the other.


From hahn at mcmaster.ca  Wed Jun 30 23:20:41 2010
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 1 Jul 2010 02:20:41 -0400 (EDT)
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <C85166A0.111E3%James.P.Lux@jpl.nasa.gov>
References: <C85166A0.111E3%James.P.Lux@jpl.nasa.gov>
Message-ID: <Pine.LNX.4.64.1007010122210.813@coffee.psychology.mcmaster.ca>

> (whether capital or life cycle).  Nobody is going to pay more just because
> the vendor did a stunt of one sort or another.

I agree: vendor stunts are advertising.  but think of them as like 
a mating display - elaborate feathers or a big rack of antlers.
they make a claim of fitness that speaks mainly to a customer's
risk-aversion: if IBM/Cray/etc can make some giant cluster work, then surely
our little cluster project will succeed.  if you have more in-house
expertise, you may not value this as much.

in a sense, this factor is anti-beowulf, since the expectation for really
commoditized parts is that they'll Just Work.  with some modest care, you
can be pretty confident that the software stack will Just Work.  especially
with open-source, which provides greater access and fixability.  so most 
of the value of brand boils down to hardware/firmware-level issues that
customers are not well-equipped to deal with those, either at bid-eval time 
or once the deal is done.

my perception, though, is that vendors try to pretend such problems don't
happen, rather than bragging about how well they solve them...

regards, mark hahn.


From bill at cse.ucdavis.edu  Wed Jun 30 23:24:00 2010
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Wed, 30 Jun 2010 23:24:00 -0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <20100630055026.GF28068@bx9.net>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<20100630055026.GF28068@bx9.net>
Message-ID: <4C2C3480.2000206@cse.ucdavis.edu>

On 06/29/2010 10:50 PM, Greg Lindahl wrote:
> On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote:
>
>> The Top500 list has many useful metrics but I didn't see any $$ based
>> metrics there.
>
> Other communities with $$-based metrics haven't had much success with
> them.
>
> In HPC, many contracts are multi-year, multi-delivery, or, they
> include significant extra stuff beyond the iron.

Just have the vendor provide a list of included equipment, and a price 
with the stipulation that anyone that wants that list of equipment gets 
it for exactly that price.  Maybe include some low level of service like 
equipment replacement via return to depot for 3 years.

So vendors would work out nice discounts for their favorite customers, 
and the customer could brag to their bosses about how much under the 
retail price they got.


From dmitri.chubarov at gmail.com  Wed Jun 30 01:23:24 2010
From: dmitri.chubarov at gmail.com (Dmitri Chubarov)
Date: Wed, 30 Jun 2010 15:23:24 +0700
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <20100630055026.GF28068@bx9.net>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<20100630055026.GF28068@bx9.net>
Message-ID: <AANLkTik1XZYBdomUK9Pv6bc4qs6zTs7vdOvBv4JQ4Mew@mail.gmail.com>

Hello,

This metric however misleading it might be, is sometimes used in press
releases. Here is just one that I could find by googling (Google
translation of a Russian original).

http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.pcmag.ru%2Fnews%2Fdetail.php%3FID%3D6234&sl=ru&tl=en

It says: "unique for the Russian market price / performance ratio of
the complex: in the light of the full cost of the solution of 158K USD
for 1 teraflop of peak performance." (The text is dated 27.02.2007)

The price per teraflops is determined by the components cost, mainly
the CPU and the memory, so looking at the Intel, AMD, whatever price
lists per socket and subsequent extrapolation is all one needs to get
this metric.

Though if you would factor the infrastructure costs in, the picture
gets much more complicated.

--dc

On Wed, Jun 30, 2010 at 12:30:12AM -0500, Rahul Nabar wrote:

> The Top500 list has many useful metrics but I didn't see any $$ based
> metrics there.
>


From bibil.thaysose at gmail.com  Wed Jun 30 08:50:27 2010
From: bibil.thaysose at gmail.com (Greg Rubino)
Date: Wed, 30 Jun 2010 11:50:27 -0400
Subject: [Beowulf] dollars-per-teraflop : any lists like the Top500?
In-Reply-To: <4C2B64BA.3000704@ias.edu>
References: <AANLkTimv4-1o_e1Fs7FB8Rg7WpC111R7BYmbvgJTmsq8@mail.gmail.com>
	<4C2B3C5D.9030405@ias.edu>
	<4C2B510E.5050406@scalableinformatics.com>
	<4C2B64BA.3000704@ias.edu>
Message-ID: <AANLkTim7xLlHXfxsXuGBWZNErfqvG_aLl4hqENhHynUQ@mail.gmail.com>

I have to say I partially agree with Prentice.  I don't know if
prestige directly translates into revenue, but if your a huge company
and your platform is the first one upon which some new innovation in
HPC is implemented (cutthroat or not), you have a huge opportunity on
your hands.  I guess it depends upon the terms under which you took
that initial "loss" (s/loss/risk/g).

>>
>> Prestige does not translate into (profitable) revenue.
>>
>
> I think you meant to say "Prestige does not always translate into
> profitable revenue.
>


From akshar.bhosale at gmail.com  Wed Jun 30 12:32:09 2010
From: akshar.bhosale at gmail.com (akshar bhosale)
Date: Thu, 1 Jul 2010 01:02:09 +0530
Subject: [Beowulf] guide for pbs/torque and mpi
Message-ID: <AANLkTik1lNXNS3n4XGzC-bQX7qlCVkSnK4UluL5mhW81@mail.gmail.com>

hi,
 we want to have a good reference guide for torque(pbs),maui and mpi

akshar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100701/3cdcb694/attachment.html>