I've noticed there is no API in boto3 for the "sync" operation that you can perform through the command line.
So,
How do I sync a local folder to a given bucket using boto3?
I've noticed there is no API in boto3 for the "sync" operation that you can perform through the command line.
So,
How do I sync a local folder to a given bucket using boto3?
I've just implemented a simple class for this matter. I'm posting it here hoping it help anyone with the same issue.
You could modify S3Sync.sync in order to take file size into account.
class S3Sync:
"""
Class that holds the operations needed for synchronize local dirs to a given bucket.
"""
def __init__(self):
self._s3 = boto3.client('s3')
def sync(self, source: str, dest: str) -> [str]:
"""
Sync source to dest, this means that all elements existing in
source that not exists in dest will be copied to dest.
No element will be deleted.
:param source: Source folder.
:param dest: Destination folder.
:return: None
"""
paths = self.list_source_objects(source_folder=source)
objects = self.list_bucket_objects(dest)
# Getting the keys and ordering to perform binary search
# each time we want to check if any paths is already there.
object_keys = [obj['Key'] for obj in objects]
object_keys.sort()
object_keys_length = len(object_keys)
for path in paths:
# Binary search.
index = bisect_left(object_keys, path)
if index == object_keys_length:
# If path not found in object_keys, it has to be sync-ed.
self._s3.upload_file(str(Path(source).joinpath(path)), Bucket=dest, Key=path)
def list_bucket_objects(self, bucket: str) -> [dict]:
"""
List all objects for the given bucket.
:param bucket: Bucket name.
:return: A [dict] containing the elements in the bucket.
Example of a single object.
{
'Key': 'example/example.txt',
'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
'ETag': '"b11564415be7f58435013b414a59ae5c"',
'Size': 115280,
'StorageClass': 'STANDARD',
'Owner': {
'DisplayName': 'webfile',
'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
}
}
"""
try:
contents = self._s3.list_objects(Bucket=bucket)['Contents']
except KeyError:
# No Contents Key, empty bucket.
return []
else:
return contents
@staticmethod
def list_source_objects(source_folder: str) -> [str]:
"""
:param source_folder: Root folder for resources you want to list.
:return: A [str] containing relative names of the files.
Example:
/tmp
- example
- file_1.txt
- some_folder
- file_2.txt
>>> sync.list_source_objects("/tmp/example")
['file_1.txt', 'some_folder/file_2.txt']
"""
path = Path(source_folder)
paths = []
for file_path in path.rglob("*"):
if file_path.is_dir():
continue
str_file_path = str(file_path)
str_file_path = str_file_path.replace(f'{str(path)}/', "")
paths.append(str_file_path)
return paths
if __name__ == '__main__':
sync = S3Sync()
sync.sync("/temp/some_folder", "some_bucket_name")
@Z.Wei commented:
Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?
I think is a interesting question that worth an answer update and not get lost in the comments.
Answer:
No, if path not in object_keys
would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)).
Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword.
Take in to account that you must check every path in the source against every path in the destination making the use of in
O(m * n), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( n * log(n) )
If I think about it, you could use sets to make the algorithm even faster (and simple, hence more pythonic):
def sync(self, source: str, dest: str) -> [str]:
# Local paths
paths = set(self.list_source_objects(source_folder=source))
# Getting the keys (remote s3 paths).
objects = self.list_bucket_objects(dest)
object_keys = set([obj['Key'] for obj in objects])
# Compute the set difference: What we have in paths that does
# not exists in object_keys.
to_sync = paths - object_keys
sournce_path = Path(source)
for path in to_sync:
self._s3.upload_file(str(sournce_path / path),
Bucket=dest, Key=path)
Search in sets
is O(1) most of the time
(https://wiki.python.org/moin/TimeComplexity) so, using sets the whole thing would be O(n) way faster that the previous O( m * log(n) ).
The code could be improved even more making methods list_bucket_objects
and list_source_objects
to return sets instead of list.
index == object_keys_length or object_keys[i] != path
. Reference: docs.python.org/3.7/library/bisect.html#searching-sorted-lists –
Aude NameError: name 'Path' is not defined
. From what module is the Path-class and how can I import it? –
Zoometry Path
class is in the module pathlib, I am not sure but I think it is available for Python >= 3.5 –
Mara if path not in object_keys:
? –
Sakhuja if path not in object_keys
would performs a linear search O(n). bisect_*
performs a binary search (list has to be ordered) which is O(log(n)). Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in
keyword. Take in to account that you must check every path in the source against every path in the destination making the use of in
O(mn), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( nlog(n) ) –
Mara © 2022 - 2024 — McMap. All rights reserved.
sync
command is implemented by the AWS Command-Line Interface (CLI), which itself uses boto (or, apparently, botocore). – Shanghai