[Beowulf] Re: dedupe Filesystem

Wed Jun 3 05:24:19 PDT 2009

I know a little bit about this from a time before SiCortex.

The big push for deduplication came from disk-to-disk backup  
companies.  As you can imagine, there is a huge advantage for  
deduplication if the problem you are trying to solve is backing up a  
thousand desktops.

For that purpose, whole file duplicate detection works great.

The next big problem is handling incremental backups.  Making them run  
fast is important.  And some applications, um, Outlook, have huge  
files (PST files) that change in minor ways every time you touch them.  
The big win here is the ability to detect and handle duplication at  
the block or sub-block level. This can have enormous performance  
advantages for incremental backups of those 1000 huge PST files.

The technology for detecting sub-block level duplication is called  
"Shingling" or "rolling hashes" and was invented by Rivest (big  
surprise I guess!) and perhaps Mark Manasse.  It is wicked clever stuff.
The same schemes are used now for finding plagiarism among pages on  
the internet.

I probably don't need to remind anyone here that deduplication on a  
live filesystem (as opposed to backups) can have really bad  
performance effects.  Imagine if you have to move the disk arms around  
for every file for every block of every file.  Modern filesystems do  
well at keeping files contiguous and often keep all the files of a  
directory nearly.  This locality gets trashed by deduplicaiton.  This  
won't matter if the problem is making backups smaller or making  
incrementals run faster, but it is not good for the performance of a  
live filesystem.

-Larry/thinking about what to do next