Using Python Requests to 'bridge' a file without loading into memory?
Asked Answered
E

4

14

I'd like to use the Python Requests library to GET a file from a url and use it as a mulitpart encoded file in a post request. The catch is that the file could be very large (50MB-2GB) and I don't want to load it in memory. (Context here.)

Following examples in the docs (multipart, stream down and stream up) I cooked up something like this:

    with requests.get(big_file_url, stream=True) as f:
        requests.post(upload_url, files={'file': ('filename', f.content)})

but I'm not sure I'm doing it right. It is in fact throwing this error - redacted from traceback:

    with requests.get(big_file_url, stream=True) as f:
    AttributeError: __exit__

Any suggestions?

Exposed answered 12/4, 2013 at 13:55 Comment(2)
replace the with ... as f: statement with f = ... and the f.content with f because with needs __enter__ and __exit__ and the docs tell me you can pass the file directlyEhtelehud
@Ehtelehud thanks for the tip, but I get this error: AttributeError: 'Response' object has no attribute 'read' - I assume this means that my code expects the file and not the responseExposed
N
1

There actually is an issue about that on Kenneth Reitz's GitHub repo. I had the same problem (although I'm just uploading a local file), and I added a wrapper class that is a list of streams corresponding to the different parts of the requests, with a read() attribute that iterates through the list and reads each part, and also gets necessary values for the headers (boundary and content-length) :

# coding=utf-8

from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs

writer = codecs.lookup('utf-8')[3]

class MultipartUploadWrapper(object):

    def __init__(self, files):
        """
        Initializer

        :param files:
            A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
        :type network_down_callback:
            Dict
        """
        super(MultipartUploadWrapper, self).__init__()
        self._cursor = 0
        self._body_parts = None
        self.content_type_header = None
        self.content_length_header = None
        self.create_request_parts(files)

    def create_request_parts(self, files):
        request_list = []
        boundary = choose_boundary()
        content_length = 0

        boundary_string = b'--%s\r\n' % (boundary)
        for fieldname, value in iter_fields(files):
            content_length += len(boundary_string)

            if isinstance(value, tuple):
                filename, data = value
                content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
                                            + ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))

            else:
                data = value
                content_disposition_string =  (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
                                            + 'Content-Type: text/plain\r\n\r\n')
            request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
            content_length += len(content_disposition_string)
            if hasattr(data, 'read'):
                data_stream = data
            else:
                data_stream = BytesIO(str(data))

            data_stream.seek(0,2)
            data_size = data_stream.tell()
            data_stream.seek(0)

            request_list.append(data_stream)
            content_length += data_size

            end_string = b'\r\n'
            request_list.append(BytesIO(end_string))
            content_length += len(end_string)

        request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
        content_length += len(boundary_string)

        # There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
        # there are *any* unicode strings passed into headers as part of the requests call.
        # For this reason all strings are explicitly converted to non-unicode at this point.
        self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
        self.content_length_header = {b'Content-Length': str(content_length)}
        self._body_parts = request_list

    def read(self, chunk_size=0):
        remaining_to_read = chunk_size
        output_array = []
        while remaining_to_read > 0:
            body_part = self._body_parts[self._cursor]
            current_piece = body_part.read(remaining_to_read)
            length_read = len(current_piece)
            output_array.append(current_piece)
            if length_read < remaining_to_read:
                # we finished this piece but haven't read enough, moving on to the next one
                remaining_to_read -= length_read
                if self._cursor == len(self._body_parts) - 1:
                    break
                else:
                    self._cursor += 1
            else:
                break
        return b''.join(output_array)

So instead of passing a 'files' keyword arg, you pass this object as 'data' attribute to your Request.request object

Edit

I've cleaned up the code

Nausea answered 25/4, 2013 at 17:26 Comment(3)
Note: the response object returned by requests.get() has no .seek() method; the length could be calculated using GET headers in this case.Pazia
thanks for this - I ended up using a completely different approach, but this would've probably done the trick!Exposed
Thanks ! I'm curious to know more about your approach, though.Nausea
P
5

As other answers have pointed out already: requests doesn't support POSTing multipart-encoded files without loading them into memory.

To upload a large file without loading it into memory using multipart/form-data, you could use poster:

#!/usr/bin/env python
import sys
from urllib2 import Request, urlopen

from poster.encode import multipart_encode # $ pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

def report_progress(param, current, total):
    sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))

url = 'http://example.com/path/'
params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
print response.read()

It can be adapted to allow a GET response object instead of a local file:

import posixpath
import sys
from urllib import unquote
from urllib2 import Request, urlopen
from urlparse import urlsplit

from poster.encode import MultipartParam, multipart_encode # pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

class MultipartParamNoReset(MultipartParam):
    def reset(self):
        pass # do nothing (to allow self.fileobj without seek() method)

get_url = 'http://example.com/bigfile'
post_url = 'http://example.com/path/'

get_response = urlopen(get_url)
param = MultipartParamNoReset(
    name='file',
    filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
    filetype=get_response.headers['Content-Type'],
    filesize=int(get_response.headers['Content-Length']),
    fileobj=get_response)

params = [('name', 'upload test'), param]
datagen, headers = multipart_encode(params, cb=report_progress)
post_response = urlopen(Request(post_url, datagen, headers))
print post_response.read()

This solution requires a valid Content-Length header (known file size) in the GET response. If the file size is unknown then the chunked transfer encoding could be used to upload the multipart/form-data content. A similar solution could be implemented using urllib3.filepost that is shipped with requests library e.g., based on @AdrienF's answer without using poster.

Pazia answered 27/4, 2013 at 0:41 Comment(1)
I'd read about poster in my research, but didn't look deep enough - thought it had the same limitations as requests - my bad. This is a good solution, but I accepted @AdrienF 's as I was actually planning on using requests exclusively. Thanks anyway :)Exposed
A
2

You can not turn anything you please into a context manager in python. It requires very specific attributes to be one. With your current code you can do the following:

response = requests.get(big_file_url, stream=True)

post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})

Using iter_content will ensure that your file is never in memory. The iterator will be used, otherwise by using the content attribute the file will be loaded into memory.

Edit The only way to reasonably do this is to use chunk-encoded uploads, e.g.,

post_response = requests.post(upload_url, data=response.iter_content())

If you absolutely need to do multipart/form-data encoding then you will have to create an abstraction layer that will take the generator in the constructor, and the Content-Length header from response (to provide an answer for len(file)) that will have a read attribute that will read from the generator. The issue again is that I'm pretty sure the entire thing will be read into memory before it will be uploaded.

Edit #2

You might be able to make a generator of your own that produces the multipart/form-data encoded data yourself. You could pass that in the same way as you would chunk-encoded-requests but you'd have to make sure you set your own Content-Type and Content-Length headers. I don't have time to sketch an example but it shouldn't be too difficult.

Awildaawkward answered 12/4, 2013 at 23:42 Comment(4)
iter_chunks throws an exception - did you mean iter_content? with the latter I get an error similar to what I mentioned in the comments above: TypeError: object of type 'generator' has no len() in the call to writeExposed
I'm going to update the answer because I forgot something, sorry. (In short you can not know the content length of that big file (I'm guessing) so you need to use chunked encoding to upload it.)Awildaawkward
I can see why this would work, thanks. Unfortunately I do actually need the multipart/form-data as I need to send it to the GAE blobstore handler, as you can see in the linked context question.Exposed
Sorry. I hadn't seen the context question. I do have a different idea now though. :)Awildaawkward
N
1

There actually is an issue about that on Kenneth Reitz's GitHub repo. I had the same problem (although I'm just uploading a local file), and I added a wrapper class that is a list of streams corresponding to the different parts of the requests, with a read() attribute that iterates through the list and reads each part, and also gets necessary values for the headers (boundary and content-length) :

# coding=utf-8

from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs

writer = codecs.lookup('utf-8')[3]

class MultipartUploadWrapper(object):

    def __init__(self, files):
        """
        Initializer

        :param files:
            A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
        :type network_down_callback:
            Dict
        """
        super(MultipartUploadWrapper, self).__init__()
        self._cursor = 0
        self._body_parts = None
        self.content_type_header = None
        self.content_length_header = None
        self.create_request_parts(files)

    def create_request_parts(self, files):
        request_list = []
        boundary = choose_boundary()
        content_length = 0

        boundary_string = b'--%s\r\n' % (boundary)
        for fieldname, value in iter_fields(files):
            content_length += len(boundary_string)

            if isinstance(value, tuple):
                filename, data = value
                content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
                                            + ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))

            else:
                data = value
                content_disposition_string =  (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
                                            + 'Content-Type: text/plain\r\n\r\n')
            request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
            content_length += len(content_disposition_string)
            if hasattr(data, 'read'):
                data_stream = data
            else:
                data_stream = BytesIO(str(data))

            data_stream.seek(0,2)
            data_size = data_stream.tell()
            data_stream.seek(0)

            request_list.append(data_stream)
            content_length += data_size

            end_string = b'\r\n'
            request_list.append(BytesIO(end_string))
            content_length += len(end_string)

        request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
        content_length += len(boundary_string)

        # There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
        # there are *any* unicode strings passed into headers as part of the requests call.
        # For this reason all strings are explicitly converted to non-unicode at this point.
        self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
        self.content_length_header = {b'Content-Length': str(content_length)}
        self._body_parts = request_list

    def read(self, chunk_size=0):
        remaining_to_read = chunk_size
        output_array = []
        while remaining_to_read > 0:
            body_part = self._body_parts[self._cursor]
            current_piece = body_part.read(remaining_to_read)
            length_read = len(current_piece)
            output_array.append(current_piece)
            if length_read < remaining_to_read:
                # we finished this piece but haven't read enough, moving on to the next one
                remaining_to_read -= length_read
                if self._cursor == len(self._body_parts) - 1:
                    break
                else:
                    self._cursor += 1
            else:
                break
        return b''.join(output_array)

So instead of passing a 'files' keyword arg, you pass this object as 'data' attribute to your Request.request object

Edit

I've cleaned up the code

Nausea answered 25/4, 2013 at 17:26 Comment(3)
Note: the response object returned by requests.get() has no .seek() method; the length could be calculated using GET headers in this case.Pazia
thanks for this - I ended up using a completely different approach, but this would've probably done the trick!Exposed
Thanks ! I'm curious to know more about your approach, though.Nausea
P
1

In theory you can just the raw object

In [1]: import requests

In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw

In [3]: raw.read(10)
Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'

In [4]: raw.read(10)
Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'

In [5]: raw.read() # take forever...

In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw

In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
Out[7]: <Response [200]>
Penetration answered 12/4, 2016 at 8:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.