[Beowulf] dedupe filesystem

Mon Jun 8 05:42:23 PDT 2009

On Jun 8, 2009, at 12:55 AM, Joe Landman wrote:

> Lawrence Stewart wrote:
>
> [...]
>
>>> Yup this is it, but on the fly is the hard part.  Doing this  
>>> comparison is computationally very expensive.  The hash  
>>> calculations are not cheap by any measure.  You most decidedly do  
>>> not wish to do this on the fly ...
>
> The assumption of a high performance disk/file system is implicit  
> here.
>
>>>
>>>> And for that, it’s all about trading between cost of storage  
>>>> space, retrieval time, and computational effort to run the  
>>>> algorithm.
>>>
>>> Exactly.
>> I think the hash calculations are pretty cheap, actually.  I just  
>> timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per  
>> second, on one core (from the disk cache).  That is substantially  
>> faster than the disk transfer rate.  If you have a parallel  
>> filesystem, you can parallize the hashes as well.
>
> Disk transfer rates are 100-120 MB/s these days.  For high  
> performance local file systems (N * disks), the data rates you  
> showed for sha1 won't cut it.  Especially if they have multiple hash  
> signatures computed in order to avoid hash collisions (ala MD5 et  
> al).  The probability of two different hashes having two or more  
> unique blocks that have a collision in the same manner is somewhat  
> smaller than the probability that a single hash function has two (or  
> more) unique blocks that have a collision.  So you compute two or  
> more in different ways, to reduce the probability of that collision.
>
> For laughs, I just tried this on our lab server.  We have a 500 MB/s  
> file system attached to it, so we aren't so concerned about data IO  
> bottlenecks.

Actual measurements and careful analysis beat handwaving every time :-)

My assumptions were a bit different I guess, still figuring 50MB/s per  
spindle, and supposing that <clients> are computing the hashes, rather  
than the servers running the disks.  If that could be arranged, then  
the cores available to compute hashes scale with the clients.

In any case, I am <not> arguing for deduplication for high performance  
filesystems, I think it is a decent idea for backups though.

Regarding hash collisions, the relevant math is the "birthday problem"  
I think. If the hash values are uniformly distributed, as they should  
be, then the probability of a collision rises to about 1/2 when the  
number of blocks reaches the square root of the size of the value  
space.  So you would have about 50% chance of a collision <somewhere>  
if you have 4 billion blocks (32 bits) and are using 64 bit hashes.
If multiple hashes are independent, the you get to add the sizes  
before taking the square root.  256 bit hashes ought to give  
negligible odds of a collision up to 64 bits worth of blocks, where  
"negligible" means much less than other sources of permanently lost  
data.

However it might make more sense from a system design perspective to  
use the hash as a hint, and to actually compare the data.  This would  
force a random block read to confirm every duplicate.  Hmm, let's  
convert sequential writes into random reads...

The compression ratio of 4K blocks to 16 byte hashes is also suspect,  
this is 250 to one, and the incremental cost ratio of disk and ram is  
not much different.  So keeping the hashes in ram is probably too  
expensive.

So in summary, deduplication is messy, complicated, has bad  
performance, and uncertain economics for HPC.  Let's not do it.

-L