[Beowulf] dedupe filesystem
stewart at serissa.com
Fri Jun 5 12:09:40 PDT 2009
On Jun 5, 2009, at 1:12 PM, Joe Landman wrote:
> Lux, James P wrote:
> It only looks at raw blocks. If they have the same hash signatures
> (think like MD5 or SHA ... hopefully with fewer collisions), then
> they are duplicates.
>> maybe a better model is a “data compression” algorithm on the fly.
> Yup this is it, but on the fly is the hard part. Doing this
> comparison is computationally very expensive. The hash calculations
> are not cheap by any measure. You most decidedly do not wish to do
> this on the fly ...
>> And for that, it’s all about trading between cost of storage space,
>> retrieval time, and computational effort to run the algorithm.
I think the hash calculations are pretty cheap, actually. I just
timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per
second, on one core (from the disk cache). That is substantially
faster than the disk transfer rate. If you have a parallel
filesystem, you can parallize the hashes as well.
More information about the Beowulf