urllib.request: any way to read from it without modifying the request object?
Asked Answered
S

2

6

Given a standard urllib.request object, retrieved so:

req = urllib.urlopen('http://example.com')

If I read its contents via req.read(), afterwards the request object will be empty.

Unlike normal file-like objects, however, the request object does not have a seek method, for I am sure are excellent reasons.

However, in my case I have a function, and I want it to make certain determinations about a request and then return that request "unharmed" so that it can be read again.

I understand that one option is to re-request it. But I'd like to be able to avoid making multiple HTTP requests for the same url & content.

The only other alternative I can think of is to have the function return a tuple of the extracted content and the request object, with the understanding that anything that calls this function will have to get the content in this way.

Is that my only option?

Salience answered 17/4, 2013 at 18:36 Comment(2)
Don't use urllib.urlopen - Also note that the urllib.urlopen() function has been removed in Python 3 in favor of urllib2.urlopen()Integument
Thanks for letting me know, although in this case the behavior from urllib2.urlopen is the same.Salience
O
3

Delegate the caching to a StringIO object(code not tested, just to give the idea):

import urllib
from io import StringIO


class CachedRequest(object):
    def __init__(self, url):
        self._request = urllib.urlopen(url)
        self._content = None

    def __getattr__(self, attr):
        # if attr is not defined in CachedRequest, then get it from
        # the request object.
        return getattr(self._request, attr)

    def read(self):
        if self._content is None:
            content = self._request.read()
            self._content = StringIO()
            self._content.write(content)
            self._content.seek(0)
            return content
        else:
            return self._content.read()

    def seek(self, i):
        self._content.seek(i)

If the code actually expects a real Request object(i.e. calls isinstance to check the type) then subclass Request and you don't even have to implement __getattr__.

Note that it is possible that a function checks for the exact class(and in this case you can't do nothing) or, if it's written in C, calls the method using C/API calls(in which case the overridden method wont be called).

Ona answered 17/4, 2013 at 18:47 Comment(2)
Wouldn't you need to set up self._content to be something like StringIO instead of None? Pretty sure you'd run in to an AttributeError when calling write.Salience
@JordanReiter Sorry. At the beginning I wrote self._content = StringIO() then I changed my mind and forgot to fix that bit of code that assumed self._content was already initialized.Ona
L
2

Make a subclass of urllib2.Request that uses a cStringIO.StringIO to hold whatever gets read. Then you can implement seek and so forth. Actually you could just use a string, but that'd be more work.

Lungan answered 17/4, 2013 at 18:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.