How does `aws s3 sync` determine if a file has been updated?
Asked Answered
D

3

30

When I run the command in the terminal back to back, it doesn't sync the second time. Which is great! It shouldn't. But, if I run my build process and run aws s3 sync programmatically, back to back, it syncs all the files both times, as if my build process is changing something differently the second time.

Can't figure out what might be happening. Any ideas?

My build process is basically pug source/ --out static-site/ and stylus -c styles/ --out static-site/styles/

Deice answered 20/4, 2017 at 21:7 Comment(2)
It might be the result of Amazon S3 being [eventually consistent](Amazon S3 Data Consistency Model). If you put a delay between the two executions, does it behave better?Impasse
I tried with a few minutes apart. Same result.Deice
H
26

According to this - http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

S3 sync compares the size of the file and the last modified timestamp to see if a file needs to be synced.

In your case, I'd suspect the build system is resulting in a newer timestamp even though the file size hasn't changed?

Hance answered 21/4, 2017 at 0:18 Comment(2)
There is an --exact-timestamps option where same-sized items will be ignored when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.Impasse
Hmmm... doesn't really help. And to fix this I'd need to interrupt pug's compiling command to run cmp or something. I can't imagine how to start doing that. I think I'll just forego this item.Deice
P
25

AWS CLI sync:

A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix.

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

You want the --size-only option which looks only at the file size not the last modified date. This is perfect for an asset build system that will change the last modified date frequently but not the actual contents of the files (I'm running into this with webpack builds where things like fonts kept syncing even though the file contents were identical). If you don't use a build method that incorporates the hash of the contents into the filename it might be possible to run into problems (if build emits same sized file but with different contents) so watch out for that.

I did manually test adding a new file that wasn't on the remote bucket and it is indeed added to the remote bucket with --size-only.

Pterodactyl answered 15/12, 2018 at 21:35 Comment(4)
Hm... but what if I change the word "lump" to "pump" in an html file or some tiny change like that, that won't change file size?Deice
@Costa No, it won't. But I would recommend using a build system that appends hashes to the filenames. At least that works great for say CSS and JavaScript files. In my projects, I usually only have one root index.html file so I'd just sync that as part of my deploy command. But if you have a lot of HTML files you'd want to work around that by syncing them differently.Pterodactyl
Gotcha. That's a fine strategy : ) I wish S3 just stored a hash of the file contents as a way to check for changes. I wonder if I could implement that on my end... o _ ODeice
@Costa I agree -- that would be the best way forward if S3 would have that option similar to rsync and other syncing tools. Doing yourself is an interesting idea and seems like it would work (just have to decide where to store the map of filename to hash -- ie put in git repo or put that up on s3 separately or only deploy from one server and keep it local to that or ...).Pterodactyl
P
12

This article is a bit dated but i'll contribute nonetheless for folks arriving here via google.

I agree with checked answer. To add additional context, AWS S3 functionality is different than standard linux s3 in a number of ways. In Linux, an md5hash can be computed to determine if a file has changed. S3 does not do this, so it can only determine based on size and/or timestamp. What's worse, AWS does not preserve timestamp when transferring either way, so timestamp is ignored when syncing to local and only used when syncing to s3.

Psycholinguistics answered 22/1, 2019 at 21:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.