[Beowulf] dedupe filesystem
landman at scalableinformatics.com
Wed Jun 3 06:18:19 PDT 2009
It might be worth noting that dedup is not intended for high performance
file systems ... the cost of computing the hash(es) is(are) huge. Dedup
is used *primarily* to prevent filling up expensive file systems of
limited size (e.g. SAN units with "fast" disks). For this crowd,
20-30TB is a huge system, and very expensive. Dedup (in theory) makes
these file systems have a greater storage density, and also allows for
faster DR, faster backup, and whatnot else ... assuming that Dedup is
meaningful for the files stored.
Its fine for slower directories, but the costs to Dedup usually involve
a hardware or software layer which isn't cheap.
Arguably, Dedup is more of a tactical effort on the part of the big
storage vendors to reduce the outflow of their customers to less
expensive storage modalities and products. It works well in some
specific cases (with lots of replication), and poorly in many others.
Think of trying to zip up a binary file with very little in the way of
repeating patterns. Dedup is roughly akin to RLE encoding, with a
shared database of blocks, using hash keys to represent specific blocks.
If your data has lots of these identical blocks, then Dedup can save
you lots of space. Point that block to the dictionary with the hash
key, and when you read that block, pull it from the dictionary.
This is how many of the backup folks get their claimed 99% compression BTW.
They don't get this in general for a random collection of different
files. They would get it for files that Dedup software can compress.
Another technique is storing original and diffs, or the current, and
backward diffs. So if you have a block that differs in two characters,
point to the original and a diff.
The problem with this (TANSTAAFL) is that your dictionary (hash->block
lookup) becomes your bottleneck (probably want a real database for
this), and that this can fail spectacularly in the face of a) high data
rates, b) minimal file similarity, c) many small operations on files.
If you have Dedup anywhere, in your backup would be good.
Just my $0.02.
Michael Di Domenico wrote:
> On Tue, Jun 2, 2009 at 1:39 PM, Ashley Pittman <ashley at pittman.co.uk> wrote:
>> I'm not sure I understand the question, if it's a case of looking for
>> duplicate files on a filesystem I use fdupes
> Fdupes is indeed the type of app i was looking for. I did run into
> one catch with it though, on first run it trounced down into a NetApp
> snapshot directory. Dupes galore...
> It would be nice if it kept a log too, so that if the files are the
> same on a second go around it didn't have to md5 every file all over
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Joseph Landman, Ph.D
Founder and CEO
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf