Sync local folder to s3 bucket using boto3
Asked Answered
M

1

11

I've noticed there is no API in boto3 for the "sync" operation that you can perform through the command line.

So,

How do I sync a local folder to a given bucket using boto3?

Mara answered 4/7, 2019 at 17:49 Comment(1)
The sync command is implemented by the AWS Command-Line Interface (CLI), which itself uses boto (or, apparently, botocore).Shanghai
M
11

I've just implemented a simple class for this matter. I'm posting it here hoping it help anyone with the same issue.

You could modify S3Sync.sync in order to take file size into account.

class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

Update:

@Z.Wei commented:

Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?

I think is a interesting question that worth an answer update and not get lost in the comments.

Answer:

No, if path not in object_keys would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)).

Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword.

Take in to account that you must check every path in the source against every path in the destination making the use of in O(m * n), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( n * log(n) )

But ...

If I think about it, you could use sets to make the algorithm even faster (and simple, hence more pythonic):

def sync(self, source: str, dest: str) -> [str]:

    # Local paths
    paths = set(self.list_source_objects(source_folder=source))

    # Getting the keys (remote s3 paths).
    objects = self.list_bucket_objects(dest)
    object_keys = set([obj['Key'] for obj in objects])

    # Compute the set difference: What we have in paths that does
    # not exists in object_keys.
    to_sync = paths - object_keys

    sournce_path = Path(source)
    for path in to_sync:
        self._s3.upload_file(str(sournce_path / path),
                                Bucket=dest, Key=path)

Search in sets is O(1) most of the time (https://wiki.python.org/moin/TimeComplexity) so, using sets the whole thing would be O(n) way faster that the previous O( m * log(n) ).

Further improvements

The code could be improved even more making methods list_bucket_objects and list_source_objects to return sets instead of list.

Mara answered 4/7, 2019 at 17:50 Comment(9)
This was very helpful for me. Just want to point out a mistake on the "path not found in object_keys" condition. It should be something like index == object_keys_length or object_keys[i] != path. Reference: docs.python.org/3.7/library/bisect.html#searching-sorted-listsAude
This will upload all files with the boto3 default content-type of binary/octet-stream. See github.com/boto/boto3/issues/548#issuecomment-450580499 on how to add mimetypes to detect the mimetype and set it on the upload_file call.Heckelphone
This looks like exactly what I need! But when I create an instance of "S3Sync" and run the method "sync", I get the error message NameError: name 'Path' is not defined. From what module is the Path-class and how can I import it?Zoometry
@Zoometry The Path class is in the module pathlib, I am not sure but I think it is available for Python >= 3.5Mara
Dig into this a little to deal with the weird bisect function. We may just use if path not in object_keys:?Sakhuja
@Sakhuja No, if path not in object_keys would performs a linear search O(n). bisect_* performs a binary search (list has to be ordered) which is O(log(n)). Most of the time you will be dealing with enough objects to make sorting and binary searching generally faster than just use the in keyword. Take in to account that you must check every path in the source against every path in the destination making the use of in O(mn), where m is the number of objects in the source and n in the destination . Using bisect the whole thing is O( nlog(n) )Mara
@Sakhuja now that I think about it (and two years more experienced) I could have used sets ... hummmMara
I am pretty sure the 'in' statement uses set hashing. O(1)Trefler
If you complete it with a function to delete the files that doesn't exist anymore, you have a complete and real syncScribble

© 2022 - 2024 — McMap. All rights reserved.