[Beowulf] Rsync - checksums

Bill Wichser bill at princeton.edu
Tue Jun 18 03:59:46 PDT 2019


Just for clarity here, we are NOT using the -c option.  The checksums 
happen whenever there is a transfer between the rsync source and the 
rsyncd on the other end.

AFM does not work in our case.  Our filesets are NOT independent and it 
would take months to make them so.  We do not have enough spare disk to 
complete a few of those anyway.  But the most important piece is that 
IBM have limits on an AFM fileset which we already exceed.

Gathering data at the source of the filesystem is the ultimate goal.  We 
are not going back to DMAPI though but will wait until IBM's new method 
is fully baked.  Regardless of this, we need a full backup to start.

This is not some trivial rsync running at the top level.  There is code 
we wrote as well as integration with Jenkins.  When we recompiled rsync 
using MD4 instead of the MD5 we see a 20% increase in performance across 
the board.  This is what sparked my question.

Bill

On 6/18/2019 5:02 AM, Peter Kjellström wrote:
> On Mon, 17 Jun 2019 08:29:53 -0700
> Christopher Samuel <chris at csamuel.org> wrote:
> 
>> On 6/17/19 6:43 AM, Bill Wichser wrote:
>>
>>> md5 checksums take a lot of compute time with huge files and even
>>> with millions of smaller ones.  The bulk of the time for running
>>> rsync is spent in computing the source and destination checksums
>>> and we'd like to alleviate that pain of a cryptographic algorithm.
>>
>> First of all I would note that rsync only uses checksums if you tell
>> it to, otherwise it just uses file times and sizes to determine what
>> to transfer.
> 
> As Chris says rsync decides if a files needs to be synced based on the
> content of the file (by hashing it on both source and destination side).
> 
> It does _NOT_ protect the transfer with said checksum nor does it
> verify the destination side write with it.
> 
> In the end the (significant) performance cost of using -c boils down to
> the cost of doing open+read of each file on both source and destination
> side (instead of just stat). The hasing algo is not the main problem.
> 
> /Peter
> 



More information about the Beowulf mailing list