[Beowulf] dedupe filesystem

Ashley Pittman ashley at pittman.co.uk
Tue Jun 2 10:39:40 PDT 2009


On Tue, 2009-06-02 at 12:34 -0400, Michael Di Domenico wrote:
> does anyone have an opinion on dedup'ing files on a filesystem, but
> not in the context of backups?  I did a google search for a program,
> but only seemed to find the big players in the context of backups and
> block levels.  i just need a file level check and report.

I'm not sure I understand the question, if it's a case of looking for
duplicate files on a filesystem I use fdupes

http://premium.caribe.net/~adrian2/fdupes.html

> Is scanning the filesystem and md5'ing the files really the best (or
> only) way to do this?

Fdupes scans the filesystem looking for files where the size matches, if
it does it md5's them checking for matches and if that matches it
finally does a byte-by-byte compare to be 100% sure.  As a result it can
take a while on filesystems with lots of duplicate files.

There is another test it could do after checking the sizes and before
the full md5, it could compare the first say Kb which should mean it
would run quicker in cases where there are lots of files which match in
size but not content but anyway I digress.

Ashley Pittman.




More information about the Beowulf mailing list