How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?
Asked Answered
M

3

15

I’m playing around with the Stack Overflow API using Python. I’m trying to decode the gzipped responses that the API gives.

import urllib, gzip

url = urllib.urlopen('http://api.stackoverflow.com/1.0/badges/name')
gzip.GzipFile(fileobj=url).read()

According to the urllib2 documentation, urlopen “returns a file-like object”.

However, when I run read() on the GzipFile object I’ve created using it, I get this error:

AttributeError: addinfourl instance has no attribute 'tell'

As far as I can tell, this is coming from the object returned by urlopen.

It doesn’t appear to have seek either, as I get an error when I do this:

url.read()
url.seek(0)

What exactly is this object, and how do I create a functioning GzipFile instance from it?

Mistrust answered 17/11, 2010 at 13:5 Comment(2)
Content-Encoding: gzip should be handled by the http library, but unfortunately it isn't. This is issue 9500 in Python's bug database, for the interested.Defloration
@Magnus: cheers, good to know it’s at least in the bug tracker.Mistrust
S
10

The urlopen docs list the supported methods of the object that is returned. I recommend wrapping the object in another class that supports the methods that gzip expects.

Other option: call the read method of the response object and put the result in a StringIO object (which should support all methods that gzip expects). This maybe a little more expensive though.

E.g.

import gzip
import json
import StringIO
import urllib

url = urllib.urlopen('http://api.stackoverflow.com/1.0/badges/name')
url_f = StringIO.StringIO(url.read())
g = gzip.GzipFile(fileobj=url_f)
j = json.load(g)
Swoon answered 17/11, 2010 at 13:14 Comment(4)
Wrapping it in a StringIO object gets past that error, but I still get an IOError: Not a gzipped fileIceboat
@ThomasK It works find for me. Are you passing url.read() to the StringIO constructor or just url? The latter fails.Tressatressia
Excellent, cheers. Unutbu’s answer was great too, but I’ll go with this one as I’m guessing the StringIO solution is more backwards compatible.Mistrust
Is there a way to do this without reading the entire urlopen response in one go? I'm looking to use something like this in a situation where the payload of the urlopen is very large (GBs), so I would like to be able to use this to stream-parse as data comes in, rather than blocking on the whole http request.Rental
C
8
import urllib2
import json
import gzip
import io

url='http://api.stackoverflow.com/1.0/badges/name'
page=urllib2.urlopen(url)
gzip_filehandle=gzip.GzipFile(fileobj=io.BytesIO(page.read()))
json_data=json.loads(gzip_filehandle.read())
print(json_data)

io.BytesIO is for Python2.6+. For older versions of Python, you could use cStringIO.StringIO.

Chauvin answered 17/11, 2010 at 13:19 Comment(0)
T
0

Here is a new update for @stefanw's answer, to whom that might think it too expensive to use that much memory.

Thanks to this article(https://www.enricozini.org/blog/2011/cazzeggio/python-gzip/, it explains why gzip doesn't work), the solution is to use Python3.

import urllib.request
import gzip

response = urllib.request.urlopen('http://api.stackoverflow.com/1.0/badges/name')
with gzip.GzipFile(fileobj=response) as f:
    for line in f:
        print(line)
Tessie answered 5/9, 2019 at 9:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.