[Beowulf] Torrents for HPC

Ellis H. Wilson III ellis at cse.psu.edu
Wed Jun 13 05:07:07 PDT 2012

On 06/13/12 11:43, Peter wrote:
> I read the initial Q that the full data set may be required by any job
> so an upgrade to my personal filters may be required :). If this were

No, you are correct about that, or at least, that's what I understood it 
to mean as well.  So for instance, Job1 has Task1-30 and the 30GB 
DataSet has Chunk1-30, each 1GB in size, spread over the entire cluster. 
  Hadoop just matches Task1 to the chunk it wants to work on.  Yes, this 
means there at least must be parts of the process that are emb. 
parallel, but that's pretty much taken for granted with big data 
computation.  The serial parts are typically handled by the shuffle and 
reduce phases at the end.

> Given that 30-60Gb is small enough copy everywhere, that sort of takes

I wouldn't expect much performance improvement going from 3 to all 30 
chunks on a given node, unless you are incredibly unlucky or something 
is terribly misconfigured with your Hadoop instance.  While 30GB isn't 
too bad to copy elsewhere, it's incredibly poor use of storage 
resources, having 30 copies of the data all over.

> The comment regarding the obscuring the replication process was directed
> more towards the user experience, they don't need to know it
> automagically happens BUT behind the scenes the copies are happening all
> the same, with the expected impact incurred on IO etc. So HDFS doesn't
> make the process impact free.

Making 30 copies of a 30GB dataset composed of 30 1GB files is quite 
different than 3 copies of each file, in size and work passed onto the 
user to manage.  Even if you get unlucky and one of your tasks does 
require remote data, Hadoop handles streaming it to the task while it 
needs it and cleans up afterwards.  It's going to be far more 
considerate about storage resources than any human being will be.

> If you are able to send more to the list regarding HDFS plan B that
> would be great and certainly something I'd be interested in hearing more
> about. Do you have a blog or similar with references regarding any of
> the above ? If so that would be much appreciated.

Not yet.  Working on a website as well -- will let you know as soon as 
that completes.



More information about the Beowulf mailing list