How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?

Asked 2/10, 2011 at 6:9 Answered 27/3, 2023 at 19:54

Solved python amazon-s3 boto rackspace cloudfiles

I'm copying a file from S3 to Cloudfiles, and I would like to avoid writing the file to disk. The Python-Cloudfiles library has an object.stream() call that looks to be what I need, but I can't find an equivalent call in boto. I'm hoping that I would be able to do something like:

shutil.copyfileobj(s3Object.stream(),rsObject.stream())

Is this possible with boto (or I suppose any other s3 library)?

Caulicle answered 2/10, 2011 at 6:9 Comment(1)

The smart_open Python library does that (both for reading and writing). – Gentry 26/1, 2015 at 8:41

The Key object in boto, which represents on object in S3, can be used like an iterator so you should be able to do something like this:

>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
...   write bytes to output stream

Or, as in the case of your example, you could do:

>>> shutil.copyfileobj(key, rsObject.stream())

Decay answered 2/10, 2011 at 7:54 Comment(2)

S3.Object is not iterable anymore. – Quandary 21/8, 2018 at 22:30

S3.object is still iterable , but with S3object['body'].iter_lines() so something like this – Gratulate 27/2, 2019 at 19:36

Other answers in this thread are related to boto, but S3.Object is not iterable anymore in boto3. So, the following DOES NOT WORK, it produces an TypeError: 's3.Object' object is not iterable error message:

s3 = boto3.session.Session(profile_name=my_profile).resource('s3')
s3_obj = s3.Object(bucket_name=my_bucket, key=my_key)

with io.FileIO('sample.txt', 'w') as file:
    for i in s3_obj:
        file.write(i)

In boto3, the contents of the object is available at S3.Object.get()['Body'] which is an iterable since version 1.9.68 but previously wasn't. Thus the following will work for the latest versions of boto3 but not earlier ones:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body:
        file.write(i)

So, an alternative for older boto3 versions is to use the read method, but this loads the WHOLE S3 object in memory which when dealing with large files is not always a possibility:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for i in body.read():
        file.write(i)

But the read method allows to pass in the amt parameter specifying the number of bytes we want to read from the underlying stream. This method can be repeatedly called until the whole stream has been read:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    while file.write(body.read(amt=512)):
        pass

Digging into botocore.response.StreamingBody code one realizes that the underlying stream is also available, so we could iterate as follows:

body = s3_obj.get()['Body']
with io.FileIO('sample.txt', 'w') as file:
    for b in body._raw_stream:
        file.write(b)

While googling I've also seen some links that could be use, but I haven't tried:

WrappedStreamingBody
Another related thread
An issue in boto3 github to request StreamingBody is a proper stream - which has been closed!!!

Fortier answered 17/11, 2016 at 17:32 Comment(3)

Very useful answer. Thanks @smallo. I appreciate that you exposed the private __raw_stream which is what I think most people are looking for. – See 17/2, 2017 at 1:27

If I pass around this body StreamingBody, does this mean the HTTP connection isn't terminated? Or is the streaming body buffered up? – Overlie 22/5, 2020 at 5:20

Not sure if this was available when this answer was written, but botocore.response.StreamingBody now exposes iter_chunks and iter_lines for this purpose. – Metatherian 10/2, 2021 at 19:6

The Key object in boto, which represents on object in S3, can be used like an iterator so you should be able to do something like this:

>>> import boto
>>> c = boto.connect_s3()
>>> bucket = c.lookup('garnaat_pub')
>>> key = bucket.lookup('Scan1.jpg')
>>> for bytes in key:
...   write bytes to output stream

Or, as in the case of your example, you could do:

>>> shutil.copyfileobj(key, rsObject.stream())

Decay answered 2/10, 2011 at 7:54 Comment(2)

S3.Object is not iterable anymore. – Quandary 21/8, 2018 at 22:30

S3.object is still iterable , but with S3object['body'].iter_lines() so something like this – Gratulate 27/2, 2019 at 19:36

I figure at least some of the people seeing this question will be like me, and will want a way to stream a file from boto line by line (or comma by comma, or any other delimiter). Here's a simple way to do that:

def getS3ResultsAsIterator(self, aws_access_info, key, prefix):        
    s3_conn = S3Connection(**aws_access)
    bucket_obj = s3_conn.get_bucket(key)
    # go through the list of files in the key
    for f in bucket_obj.list(prefix=prefix):
        unfinished_line = ''
        for byte in f:
            byte = unfinished_line + byte
            #split on whatever, or use a regex with re.split()
            lines = byte.split('\n')
            unfinished_line = lines.pop()
            for line in lines:
                yield line

@garnaat's answer above is still great and 100% true. Hopefully mine still helps someone out.

Scleroma answered 3/6, 2013 at 4:29 Comment(6)

split on other both types line endings with: lines = re.split(r'[\n\r]+', byte) - helpful for CSV files exported from Excel – Comfy 2/5, 2014 at 20:55

one more note: I had to add yield unfinished_line after the for byte in f: loop was complete, otherwise the last line would not get processed – Comfy 13/5, 2014 at 18:27

Is there a good reason why this is not part of the Boto3 API? If not, should one submit a pull request to fix this? I'd be super down for knocking something like it up! – Arjun 2/8, 2016 at 5:15

@Scleroma yes I'll knock a generator-based thing up which will chunk the stream online by a given delimiter? Super keen to smash that out! – Arjun 2/8, 2016 at 8:9

Makes sense to me, especially the delimiter part. something like getS3ResultsAsIterator(self, aws_access_info, key, prefix, delimiter="\n"): – Scleroma 2/8, 2016 at 18:29

Lets see how this pull request goes over at botocore: github.com/boto/botocore/pull/1034 – Arjun 19/9, 2016 at 4:13

Botocore's StreamingBody has an iter_lines() method:

https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody.iter_lines

So:

import boto3
s3r = boto3.resource('s3')
iterator = s3r.Object(bucket, key).get()['Body'].iter_lines()

for line in iterator:
    print(line)

Showers answered 31/8, 2018 at 19:28 Comment(2)

this doesn't continue the stream it just gets one chunk – Chlorobenzene 8/1, 2021 at 0:32

@Chlorobenzene You can specify the chunk size as you need: .iter_lines(chunk_size=1024) – Trave 14/10, 2021 at 5:45

This is my solution of wrapping streaming body:

import io
class S3ObjectInterator(io.RawIOBase):
    def __init__(self, bucket, key):
        """Initialize with S3 bucket and key names"""
        self.s3c = boto3.client('s3')
        self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body']

    def read(self, n=-1):
        """Read from the stream"""
        return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)

Example usage:

obj_stream = S3ObjectInterator(bucket, key)
for line in obj_stream:
    print line

Bren answered 28/11, 2016 at 22:26 Comment(0)

If you are open to other options, smart_open is a utility for streaming large files in Python, and it makes work extremely easy.

Here are two examples:

import boto3
from smart_open import open

session = boto3.Session(
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
client = session.client('s3')

for line in open(
    "s3://my-bucket/my-file.txt",
    transport_params=dict(client=client),
):
    print(line)

For compressed file:

import boto3
from smart_open import open

session = boto3.Session(
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
client = session.client('s3')

for line in open(
    "s3://my-bucket/my-file.txt.gz",
    encoding="utf-8",
    transport_params=dict(client=client),
):
    print(line)

Dravidian answered 27/3, 2023 at 19:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags