How to mirror a harddisk for backup purpose

Robert G. Brown rgb at phy.duke.edu
Fri Jun 22 12:58:43 PDT 2001


On Fri, 22 Jun 2001, David Vos wrote:

> On Fri, 22 Jun 2001, David Vos wrote:
> > The problem with tar is that you need disk space to store the intermediate
> > file between your backup and restore.
>
> Correction.  You can use tar without an intermediate file.  I was thinking
> of a different situation I was messing with yesterday.  Oops.

Since we are giving recipes, let me give one for a tarpipe.  This one
presumes that source directory and target directory exist and are
mounted on the same system (although one could be NFS mounted).  There
are also recipes that will run over a network.

To copy the contents of one directory (let's say /var) to another (let's
say /var_new), as root:

cd /var
tar cplsSf - . | (cd /var_new;tar xf -)

If you like to watch, add a v to xpf (to see the files being unpacked).
This slows it down a bit.  The flags stand for c(reate), p(reserve
permissions and times), l(ocal filesystem only), s(ame order), S(parse
files efficiently), f(ile to write to is) - (stdout).  The unpack
command is run in a subshell, hence the ().  This will not copy files in
e.g. /var/spool/mail if it is an NFS mount -- if you want them copied
you'd need to remove the l option.

Another tool that I haven't heard mentioned for keeping filesystems in
sync that is MUCH more efficient than dd or tar is rsync.  If your
purpose is to maintain a reasonably accurate archival mirror of a key
work directory, especially over the network, rsync is a great choice.
It will run on top of your choice of remote shell, rsh if your site is
low security or (preferred, in my opinion) ssh.

rsync makes it very easy to maintain absolutely identical directory
structures (not partitions or filesystems per se) with minimum effort.
One recipe for this sort of function goes into my "synccvs" script,
which I use to keep CVS repositories sync'd across several platforms I
work on in different networks, e.g. my home network, my laptop, the
physics department network.  By using this script, I can easily pop an
exact copy of a working CVS repository on my workstation at Duke onto my
laptop before a trip, work on the project all I want on the trip
(checking it in as needed) and then pop an exact copy of the repository
back from my laptop to all my other CVSROOTs when I get back.

The script is pretty trivial:

#!/bin/sh

# Correct command-line invocation usage:
Usage="Usage: `basename $0` cvs_pkg cvshost"

# Usage fragment
if [ $# -ne 2 ]
then
	echo $Usage >&2
	exit 1
fi

CVS_PKG=$1
CVS_HOST=$2

RSYNC_RSH=ssh
export RSYNC_RSH
echo "Synchronizing package $CVS_PKG with host $CVS_HOST at `date`"
rsync -avz --delete $CVSROOT/$CVS_PKG $CVS_HOST:\$CVSROOT

(note that I'm too lazy to even do a proper job of parsing the command
line) and the recipe for keeping pretty much arbitrary directory
structures sync'd is obvious.  The only trickery is to NOT rsync
$CVSROOT to $CVSROOT, or you'll end up with e.g.
/home/rgb/Src/CVSROOT/CVSROOT -- it copies INTO the target directory,
not onto the target directory.

When rsync runs it starts by doing full directory listings of source and
target, checks to identify the files it needs to actually update, and
only updates those files.  So if you've altered only three files (and
the other 4257 files, occupying 500 MB, are untouched) it only sends
three files instead of 4260 as tar would or however many bytes that
there are in the partition as dd would.  Sending the stat information is
of course generally MUCH cheaper than sending the data -- rsync will
sync quite large directory structures in a few seconds to a few minutes.
It also transparently and automagically compresses and decompresses
files (with the z flag) if doing so makes sense for your network.  Since
I often rsync through a DSL connection, it makes sense for me.  Over
100BT or better it might not, although it probably doesn't really
matter as syncing is pretty fast regardless.

The --delete flag tells it to delete any files in the target that no
longer exist in the source.  Note that tar (as far as I know) does NOT
remove existing files that do not conflict with stuff on the archive.
In the tarpipe example above, if /var_new/JUNK already existed (but not
/var/JUNK) you would probably still find /var_new/JUNK there after the
tarpipe completed.  If you are not careful, using it to mirror a rapidly
changing directory structure will end up with a mirror consisting of the
union of all files and directories that ever existed in the source
directory.  To avoid this, you have to delete all the files in the
target before beginning.  This leaves a small window when your mirror
doesn't exist and a crash in the primary will lose the directory.  I
therefore think of rsync as much smarter and much safer (there are LOTS
of options for rsync and it works transparently over the network).

Both tar and rsync require that a filesystem already exist on the target
disk.  dd does not. For example, you can do a poor man's copy of a CD
rom by reading the raw /dev/cdrom into disk_image.iso and then mount it
via loopback or copy it out onto a CD.  I don't believe dd is
recommended for writing CD's although in principle it should work -- if
nothing at all interrupts the write process.  I could definitely be
wrong about the latter as I've never tried it.

dump has also been ported to linux, and one can also copy filesystems
via a dump | restore pipe (I used to do this from time to time on Suns)
but because dump is far from universal on linux boxen I have fallen back
to using tar for the same purpose in pretty much the same way.

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list