Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] dedupe filesystem

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Lawrence Stewart stewart at serissa.com
Fri Jun 5 12:09:40 PDT 2009


On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:

> Lux, James P wrote:
>
> It only looks at raw blocks.  If they have the same hash signatures  
> (think like MD5 or SHA ... hopefully with fewer collisions), then  
> they are duplicates.
>
>> maybe a better model is a  “data compression” algorithm on the fly.
>
> Yup this is it, but on the fly is the hard part.  Doing this  
> comparison is computationally very expensive.  The hash calculations  
> are not cheap by any measure.  You most decidedly do not wish to do  
> this on the fly ...
>
>> And for that, it’s all about trading between cost of storage space,  
>> retrieval time, and computational effort to run the algorithm.
>
> Exactly.


I think the hash calculations are pretty cheap, actually.  I just  
timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per  
second, on one core (from the disk cache).  That is substantially  
faster than the disk transfer rate.  If you have a parallel  
filesystem, you can parallize the hashes as well.

-L





More information about the Beowulf mailing list