[Beowulf] dedupe filesystem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Lawrence Stewart stewart at serissa.comThu Jun 25 11:44:57 PDT 2009
- Previous message: [Beowulf] dedupe filesystem
- Next message: [Beowulf] dedupe filesystem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Rahul Nabar wrote: > > > On Tue, Jun 2, 2009 at 12:39 PM, Ashley Pittman <ashley at pittman.co.uk > <mailto:ashley at pittman.co.uk>> wrote: > > Fdupes scans the filesystem looking for files where the size > matches, if > it does it md5's them checking for matches and if that matches it > finally does a byte-by-byte compare to be 100% sure. > > > Why is a full byte-by-byte comparison needed even after a md5 sum > matches? I know there is a vulnerability in md5 but that's more of a > security thing and by random chance super unlikely , right? > > Or, why not use another checksum that is as yet not vulnerable? SHA1? > SHA2? etc.? Or are they way too expensive to compute? > > Just curious.... > > -- > Rahul > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > It is a tradeoff between the size of the hash and the amount of storage it takes. Due to the "birthday problem," when you use an N bit hash, the probability of an erroneous hash collision rises to about 1/2 when the number of blocks in the file system rises to 2**(N/2). If you have 64 bit hashes and 1024 byte blocks, for example, this happens with about 4 TB of data, which is not much. You could choose to use 128 bit hashes, and cut down this problem by another factor of 2**32, but it may work out cheaper to do the real compare and be sure, one time out of 2**32 than to double the storage for hashes. If you <are> going to depend on hash collisions meaning the data is identical, then you really need 256 bit hashes, probably, because you want the probability of an erroneous collision to be much smaller than other sources of data loss, which ought to be down in the 2**-45 area. (1e-15 ish). The security properties of the hashes are not so important, just their randomness. -Larry
- Previous message: [Beowulf] dedupe filesystem
- Next message: [Beowulf] dedupe filesystem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
