Overriding urllib2.HTTPError or urllib.error.HTTPError and reading response HTML anyway

Asked 10/2, 2010 at 0:55 Answered 10/4, 2016 at 11:40

I receive a 'HTTP Error 500: Internal Server Error' response, but I still want to read the data inside the error HTML.

With Python 2.6, I normally fetch a page using:

import urllib2
url = "http://google.com"
data = urllib2.urlopen(url)
data = data.read()

When attempting to use this on the failing URL, I get the exception urllib2.HTTPError:

urllib2.HTTPError: HTTP Error 500: Internal Server Error

How can I fetch such error pages (with or without urllib2), all while they are returning Internal Server Errors?

Note that with Python 3, the corresponding exception is urllib.error.HTTPError.

Octans answered 10/2, 2010 at 0:55 Comment(0)

136

The HTTPError is a file-like object. You can catch it and then read its contents.

try:
    resp = urllib2.urlopen(url)
    contents = resp.read()
except urllib2.HTTPError, error:
    contents = error.read()

Hrutkay answered 10/2, 2010 at 1:18 Comment(4)

Once we have done error.read(), error.read() subsequently returns empty string. Sometimes this messes up code elsewhere. How can we politely put the contents of the error back for others? – Peradventure 9/4, 2011 at 21:7

@Matt I've never tried this, but since it's a file-like object, you might be able to do a error.seek(0) to reset the "file pointer" to the beginning of the stream. Not every file-like object is required to implement the random access portion of the I/O interface, so not sure if it works. If it doesn't, you might consider asking this in its own question so you'll tap a bigger audience. – Hrutkay 10/4, 2011 at 22:22

Note that in degenerate cases HTTPError may not behave as a file-like object. Verify that read() is available with hasattr. – Drucilladrucy 4/10, 2011 at 15:41

Since the underlying stream is an http response, it is non-seekable, meaning that you cannot call seek() on it. – Ance 15/1, 2013 at 16:2

If you mean you want to read the body of the 500:

request = urllib2.Request(url, data, headers)
try:
        resp = urllib2.urlopen(request)
        print resp.read()
except urllib2.HTTPError, error:
        print "ERROR: ", error.read()

In your case, you don't need to build up the request. Just do

try:
        resp = urllib2.urlopen(url)
        print resp.read()
except urllib2.HTTPError, error:
        print "ERROR: ", error.read()

so, you don't override urllib2.HTTPError, you just handle the exception.

Hume answered 10/2, 2010 at 0:59 Comment(1)

No, I want to read the HTML the server would send to the user's browser if they accidentally went to one of the 500 internal error pages. Just like, if urllib broke on any 404 page (I'm not sure if it does, I haven't tried), I want to read the html the 404 page provides (E.G. if the site does a custom 404 page). – Octans 10/2, 2010 at 1:11

-1

alist=['http://someurl.com']

def testUrl():
    errList=[]
    for URL in alist:
        try:
            urllib2.urlopen(URL)
        except urllib2.URLError, err:
            (err.reason != 200)
            errList.append(URL+" "+str(err.reason))
            return URL+" "+str(err.reason)
    return "".join(errList)

testUrl()

Stalder answered 10/4, 2016 at 11:40 Comment(2)

you should add descriptive text to your answer – Nitroglycerin 10/4, 2016 at 11:53

err.reason does not actually provide the same info that err.read() provides. The later can be more specifically useful. – Casiecasilda 7/10, 2016 at 4:38

Recommended topics

Hot tags