RSync single (archive) file that changes every time
Asked Answered
B

3

6

I am working on an open source backup utility that backs up files and transfers them to various external locations such as Amazon S3, Rackspace Cloud Files, Dropbox, and remote servers through FTP/SFTP/SCP protocols.

Now, I have received a feature request for doing incremental backups (in case the backups that are made are large and become expensive to transfer and store). I have been looking around and someone mentioned the rsync utility. I performed some tests with this but am unsure whether this is suitable, so would like to hear from anyone that has some experience with rsync.

Let me give you a quick rundown of what happens when a backup is made. Basically it'll start dumping databases such as MySQL, PostgreSQL, MongoDB, Redis. It might take a few regular files (like images) from the file system. Once everything is in place, it'll bundle it all in a single .tar (additionally it'll compress and encrypt it using gzip and openssl).

Once that's all done, we have a single file that looks like this:
mybackup.tar.gz.enc

Now I want to transfer this file to a remote location. The goal is to reduce the bandwidth and storage cost. So let's assume this little backup package is about 1GB in size. So we use rsync to transfer this to a remote location and remove the file backup locally. Tomorrow a new backup file will be generated, and it turns out that a lot more data has been added in the past 24 hours, and we build a new mybackup.tar.gz.enc file and it looks like we're up to 1.2GB in size.

Now, my question is: Is it possible to transfer just the 200MB that got added in the past 24 hours? I tried the following command:

rsync -vhP --append mybackup.tar.gz.enc backups/mybackup.tar.gz.enc

The result:

mybackup.tar.gz.enc 1.20G 100% 36.69MB/s 0:00:46 (xfer#1, to-check=0/1)

sent 200.01M bytes
received 849.40K bytes
8.14M bytes/sec
total size is 1.20G
speedup is 2.01

Looking at the sent 200.01M bytes I'd say the "appending" of the data worked properly. What I'm wondering now is whether it transferred the whole 1.2GB in order to figure out how much and what to append to the existing backup, or did it really only transfer the 200MB? Because if it transferred the whole 1.2GB then I don't see how it's much different from using the scp utility on single large files.

Also, if what I'm trying to accomplish is at all possible, what flags do you recommend? If it's not possible with rsync, is there any utility you can recommend to use instead?

Any feedback is much appreciated!

Bechtel answered 4/3, 2011 at 23:29 Comment(0)
B
6

It sent only what it says it sent - only transferring the changed parts is one of the major features of rsync. It uses some rather clever checksumming algorithms (and it sends those checksums over the network, but this is negligible - several orders of magnitude less data than transferring the file itself; in your case, I'd assume that's the .01 in 200.01M) and only transfers those parts it needs.

Note also that there already are quite powerful backup tools based on rsync - namely, Duplicity. Depending on the license of your code, it may be worthwhile to see how they do this.

Basidium answered 4/3, 2011 at 23:57 Comment(5)
Thanks for the reply. Yeah I was a bit unsure because the backup I generate every time is a completely new file. All the databases are dumped again, the images will be gathered again, and that'll be combined in to a single new mybackup.tar.gz.enc. Since this file basically is a whole new file I kind of had my doubts that it might not understand, or break the algorithm or something. But yeah you have a point. Thanks for your feedback!Bechtel
@Michael van Rooijen: It doesn't matter if it's new or not, what matters are differences between the file you have locally and the remote one. Since the process of database dumping is deterministic, the various dumps of the same database will have much in common.Basidium
Right. When I package everything I bundled in to a .tar file it does indeed only send a few KB for a file that's actually 3.5MB. However, once I compress the file with GZip it'll start sending about 2MB again. So while the amount that's transferred is still slightly reduced, it seems RSync has a hard time dealing with compressed backups. I'm assuming this is the same with encryption. So I will probably have to keep it at .tar and RSync that. Thanks for your help!Bechtel
@Michael van Rooijen: rsync has built-in compression (with -z switch), so manually de/compressing is not necessary. (Also, look at the --fuzzy option, could be useful in your situation). manpagez.com/man/1/rsyncBasidium
Also, if anyone is still reading this, gzip has the --rsyncable option exactly for this.Basidium
K
8

The nature of gzip is such that small changes in the source file can result in very large changes to the resultant compressed file - gzip will make its own decisions each time about the best way to compress the data that you give it.

Some versions of gzip have the --rsyncable switch which sets the block size that gzip works at to the same as rsync's, which results in a slightly less efficient compression (in most cases) but limits the changes to the output file to the same area of the output file as the changes in the source file.

If that's not available to you, then it's typically best to rsync the uncompressed file (using rsync's own compression if bandwidth is a consideration) and compress at the end (if disk space is a consideration). Obviously this depends on the specifics of your use case.

Konya answered 24/10, 2012 at 14:43 Comment(1)
FWIW In rsync, -z will compress file data during the transfer. Perhaps in some cases that may be an alternative to gzipping up front...Egyptian
B
6

It sent only what it says it sent - only transferring the changed parts is one of the major features of rsync. It uses some rather clever checksumming algorithms (and it sends those checksums over the network, but this is negligible - several orders of magnitude less data than transferring the file itself; in your case, I'd assume that's the .01 in 200.01M) and only transfers those parts it needs.

Note also that there already are quite powerful backup tools based on rsync - namely, Duplicity. Depending on the license of your code, it may be worthwhile to see how they do this.

Basidium answered 4/3, 2011 at 23:57 Comment(5)
Thanks for the reply. Yeah I was a bit unsure because the backup I generate every time is a completely new file. All the databases are dumped again, the images will be gathered again, and that'll be combined in to a single new mybackup.tar.gz.enc. Since this file basically is a whole new file I kind of had my doubts that it might not understand, or break the algorithm or something. But yeah you have a point. Thanks for your feedback!Bechtel
@Michael van Rooijen: It doesn't matter if it's new or not, what matters are differences between the file you have locally and the remote one. Since the process of database dumping is deterministic, the various dumps of the same database will have much in common.Basidium
Right. When I package everything I bundled in to a .tar file it does indeed only send a few KB for a file that's actually 3.5MB. However, once I compress the file with GZip it'll start sending about 2MB again. So while the amount that's transferred is still slightly reduced, it seems RSync has a hard time dealing with compressed backups. I'm assuming this is the same with encryption. So I will probably have to keep it at .tar and RSync that. Thanks for your help!Bechtel
@Michael van Rooijen: rsync has built-in compression (with -z switch), so manually de/compressing is not necessary. (Also, look at the --fuzzy option, could be useful in your situation). manpagez.com/man/1/rsyncBasidium
Also, if anyone is still reading this, gzip has the --rsyncable option exactly for this.Basidium
P
1

New rsync --append WILL BREAK your file contents, if there are any changes in your existing data. (Since 3.0.0)

Phallus answered 22/10, 2013 at 9:58 Comment(1)
Do you have a link to elaborate this? Are you referring to the fact that it causes rsync to update a file by appending data onto the end of the file, which presumes that the data that already exists on the receiving side is identical with the start of the file on the sending side. ?Egyptian

© 2022 - 2024 — McMap. All rights reserved.