Why is gsutil rsync re-downloading all our files?
Asked Answered
B

2

8

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.

Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).

Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.

Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.

I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.

Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.

Thanks!

Bounds answered 18/8, 2016 at 11:7 Comment(3)
In gsutil 4.20 (released 2016-07-20), the change detection algorithm for sync'ing changed from using only file size to comparing both the size and file modification time of local files. Are the file modification times on the dev boxes different that those on the deploy boxes? If so, that could explain this issue.Poitiers
Hey, thanks for your help Travis! I think that's almost certainly the answer; we created a new deploy box instance 227 days ago and rsync'd all the files onto it, and it seems from find . -type f -mtime +227 that the initial sync wrote all the modification times as the moment when they were rsync'd rather than their original timestamps from GCS. Is there anything we can do about this, apart from remove the whole lot from dev and re-rsync them? Is this what gsutil should do, anyway? (Also, if you want to put this in an answer then I can accept it and ask my follow-ups as comments there :-))Bounds
The problem is that files with unchanged contents are getting sync'ed down to dev boxes, yes? Are you changing the mtime of files on the deploy box in any way when you rsync from deploy -> GCS? I would expect that a sync from GCS -> dev would copy the files once if their mtimes differed, but subsequent syncs would be incremental. However, if you created a new deploy box with different file mtimes and then uploaded those, that would appear as all of the files had been modified, and would cause another "full sync" down to dev boxes.Poitiers
P
1

gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Poitiers answered 19/8, 2016 at 17:14 Comment(1)
I want to call out a correction (from the documentation): for cloud to local rsync, if file mtime metadata doesn't exist, the object creation time is used instead of checksums.Poitiers
P
9

I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process. I think (and hope) the documentation is slightly wrong stating that

compare checksums for files if the size of source and destination as well as mtime match

as it seems to use checksum even if mtime does not match

Polyphone answered 12/10, 2016 at 10:12 Comment(2)
I had local files updated constantly but not changing. The -c option worked for me... to get back to the original sync behavior. Much faster. Thank you!Dichogamy
Thanks, -c helps a lot. Also be sure to install compiled crcmod: cloud.google.com/storage/docs/gsutil/addlhelp/…Vestal
P
1

gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

Poitiers answered 19/8, 2016 at 17:14 Comment(1)
I want to call out a correction (from the documentation): for cloud to local rsync, if file mtime metadata doesn't exist, the object creation time is used instead of checksums.Poitiers

© 2022 - 2024 — McMap. All rights reserved.