On Thu, Jan 22, 2009 at 1:43 PM, Mike Miller <mbmiller at taxa.epi.umn.edu>wrote:

>
> If every file has to be moved, the comparing would be wasted time, but if
> files are large and most do not have to be moved, the comparison may
> massively save time, especially if the network is slow.
>
>
> It happens that I started to write the info below a couple of months ago
> to share with this list and did not finish it, but I'm finishing it now.
> My problem was to copy many files from one machine to another, but none of
> the files existed on the target machine.  I really just wanted to make a
> gzipped tar file (.tgz) and send it to another machine.  I didn't have
> much free disk space on the source machine so I had to do a little work to
> figure out the tricks.  Read on:
>
>
> I want to move files from one GNU/Linux box to another.  The disks are
> nearly full on box with the files currently on it, so I can't write to
> .tgz on the source machine and send the .tgz file.  The data are about
> 13GB uncompressed and about 3.7GB in .tgz format.  This is how I get the
> latter number:
>
> tar zpcf - directory | wc -c
>
> That sends the zipped tar to stdout where the bytes are counted by wc.  I
> have about 210,000 files and directories.
>
> There are some good suggestions here on how to proceed:
>
> http://happygiraffe.net/copy-net
>
> I wanted to have the .tgz file on the other side instead of having tar
> unpackage it automatically, so I find out I could do this on the old
> machine to send files to the new machine...
>
> tar zpcf - directory | ssh user at target.machine "cat > backup.tgz"
>
> ...and it packs "directory" from the old machine into the backup.tgz file
> on the new machine.  Nice.
>
> One small problem:  I didn't have a way to be sure that there were no
> errors in file transmission.  First some things that did not to work:
>
> tar zpcf - directory | md5sum
>
> Testing that on a small directory gave me, to my surprise, different
> results every time.  What was changing?  I didn't get it.  I could tell
> that it was probably caused by gzip because...
>
> $ echo "x" | gzip - > test1.gz
>
> $ echo "x" | gzip - > test2.gz
>
> $ md5sum test?.gz
> 358cc3d6fe5d929cacd00ae4c2912bf2  test1.gz
> 601a8e99e56741d5d8bf42250efa7d26  test2.gz
>
> So gzip must have a random seed in it, or it is incorporating the
> timestamp into the file somehow -- something is changing.  Then I realized
> that I just had to use this method of checking md5sums...
>
> On the source machine:
> tar pcf - directory | md5sum
>
> Then do this to transfer the data:
> tar zpcf - directory | ssh user at target.machine "cat > backup.tgz"
>
> After transferring, do this on the target machine:
> gunzip -c backup.tgz | md5sum
>
> The two md5sums are created without making new files on either side and
> they will match if there are no errors.  I moved about 30GB of compressed
> data this way in three large .tgz files and found no errors -- the md5sums
> always matched.



To me, the file comparison isn't that big of a deal, and I'd only be
concerned about the time it took if it was a cronjob scheduled to run  in a
tight amount of time (say every 10 minutes for a 3GB FS).  If it's to
populate a new system, it wouldn't bother me.  I would say if it's that much
of a concern on the initial load, then you haven't given yourself enough
time to do the work.  Remeber the 6 P's...

While I admire the thought you put into your process above.  IMO, it's not
efficient enough for my tastes.  Also, too many chances for errors.  Here's
how I would have done it: