IncompleteRead using httplib
Asked Answered
R

3

26

I have been having a persistent problem getting an rss feed from a particular website. I wound up writing a rather ugly procedure to perform this function, but I am curious why this happens and whether any higher level interfaces handle this problem properly. This problem isn't really a show stopper, since I don't need to retrieve the feed very often.

I have read a solution that traps the exception and returns the partial content, yet since the incomplete reads differ in the amount of bytes that are actually retrieved, I have no certainty that such solution will actually work.

#!/usr/bin/env python
import os
import sys
import feedparser
from mechanize import Browser
import requests
import urllib2
from httplib import IncompleteRead

url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'

content = feedparser.parse(url)
if 'bozo_exception' in content:
    print content['bozo_exception']
else:
    print "Success!!"
    sys.exit(0)

print "If you see this, please tell me what happened."

# try using mechanize
b = Browser()
r = b.open(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using mechanize", e

# try using urllib2
r = urllib2.urlopen(url)
try:
    r.read()
except IncompleteRead, e:
    print "IncompleteRead using urllib2", e


# try using requests
try:
    r = requests.request('GET', url)
except IncompleteRead, e:
    print "IncompleteRead using requests", e

# this function is old and I categorized it as ...
# "at least it works darnnit!", but I would really like to 
# learn what's happening.  Please help me put this function into
# eternal rest.
def get_rss_feed(url):
    response = urllib2.urlopen(url)
    read_it = True
    content = ''
    while read_it:
        try:
            content += response.read(1)
        except IncompleteRead:
            read_it = False
    return content, response.info()


content, info = get_rss_feed(url)

feed = feedparser.parse(content)

As already stated, this isn't a mission critical problem, yet a curiosity, as even though I can expect urllib2 to have this problem, I am surprised that this error is encountered in mechanize and requests as well. The feedparser module doesn't even throw an error, so checking for errors depends on the presence of a 'bozo_exception' key.

Edit: I just wanted to mention that both wget and curl perform the function flawlessly, retrieving the full payload correctly every time. I have yet to find a pure python method to work, excepting my ugly hack, and I am very curious to know what is happening on the backend of httplib. On a lark, I decided to also try this with twill the other day and got the same httplib error.

P.S. There is one thing that also strikes me as very odd. The IncompleteRead happens consistently at one of two breakpoints in the payload. It seems that feedparser and requests fail after reading 926 bytes, yet mechanize and urllib2 fail after reading 1854 bytes. This behavior is consistend, and I am left without explanation or understanding.

Retractor answered 3/1, 2013 at 23:43 Comment(0)
M
26

At the end of the day, all of the other modules (feedparser, mechanize, and urllib2) call httplib which is where the exception is being thrown.

Now, first things first, I also downloaded this with wget and the resulting file was 1854 bytes. Next, I tried with urllib2:

>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
 'Content-Type: text/xml; charset=utf-8\r\n',
 'Server: Microsoft-IIS/7.5\r\n',
 'X-AspNet-Version: 4.0.30319\r\n',
 'X-Powered-By: ASP.NET\r\n',
 'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
 'Via: 1.1 BC1-ACLD\r\n',
 'Transfer-Encoding: chunked\r\n',
 'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)

So it is reading all 1854 bytes but then thinks there is more to come. If we explicitly tell it to read only 1854 bytes it works:

>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

Obviously, this is only useful if we always know the exact length ahead of time. We can use the fact the partial read is returned as an attribute on the exception to capture the entire contents:

>>> try:
...     contents = f.read()
... except httplib.IncompleteRead as e:
...     contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'

This blog post suggests this is a fault of the server, and describes how to monkey-patch the httplib.HTTPResponse.read() method with the try..except block above to handle things behind the scenes:

import httplib

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead, e:
            return e.partial

    return inner

httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

I applied the patch and then feedparser worked:

>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
 'encoding': 'utf-8',
 'entries': ...
 'status': 200,
 'version': 'rss20'}

This isn't the nicest way of doing things, but it seems to work. I'm not expert enough in the HTTP protocols to say for sure whether the server is doing things wrong, or whether httplib is mis-handling an edge case.

Mika answered 7/1, 2013 at 23:41 Comment(2)
While I agree that it's not a nice way of doing things, it's certainly much better than the method I was using. (I really need to practice using decorators more often). I'm not an expert in the HTTP protocols either, nor whether httplib is treating this properly or not, which is why I felt that this might be a good question to ask. FWIW, every other page on this site works well, and it is only when accessing the rss url's that this problem occurs on their http server.Retractor
@Retractor - maybe its something to do with the content type of the response, i.e., the way the server is configured text/html responses work fine but text/xml don't? If no more comprehensive answers show up you could always try posting this question to the Python mailing list and see if anybody there can give a diagnosis.Mika
A
7

I find out in my case, send a HTTP/1.0 request , fix the problem, just adding this to the code:

import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'

after I do the request :

req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()

after I back to http 1.1 with (for connections that support 1.1) :

httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
Armlet answered 17/12, 2013 at 22:14 Comment(3)
Worked for me too! Thanks a lot! Do you have any idea why this is happening? What's so special in 1.0 for incomplete reads?October
you force old connection type, you force not use one http 1.1 capability something like read in chunks, should happen often when you try download larger files ...Finished
Not all servers accept http 1.0 - I am getting 404 from one of them.Knighton
F
0

I have fixed the issue by using HTTPS instead of HTTP and its working fine. No code change was required.

Flute answered 3/3, 2019 at 19:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.