[Beowulf] dedupe filesystem

Ashley Pittman ashley at pittman.co.uk
Sun Jun 28 10:19:50 PDT 2009


On Thu, 2009-06-25 at 13:09 -0500, Rahul Nabar wrote:

> On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman <ashley at pittman.co.uk>
> wrote:
>         Fdupes scans the filesystem looking for files where the size
>         matches, if
>         it does it md5's them checking for matches and if that matches
>         it
>         finally does a byte-by-byte compare to be 100% sure.
> 
> Why is a full byte-by-byte comparison needed even after a md5 sum
> matches? I know there is a vulnerability in md5 but that's more of a
> security thing and by random chance super unlikely , right? 

> Just curious....

Checksums are a (inherently imperfect) way of checking that two files
aren't different, they are not intended to and cannot prove that two
files are the same.

If you relied on the md5 sum alone there would be collisions and those
collisions would result in you losing data.

Ashley Pittman,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




More information about the Beowulf mailing list