urllib2.urlopen() vs urllib.urlopen() - urllib2 throws 404 while urllib works! WHY?
Asked Answered
T

1

18
import urllib

print urllib.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read()

The above script works and returns the expected results while:

import urllib2

print urllib2.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read()

throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 387, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 498, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 425, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 506, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Does anyone know why this is? I'm running this from laptop on my home network with no proxy settings - just straight from my laptop to the router then to the www.

Tralee answered 22/12, 2009 at 15:34 Comment(0)
E
35

That URL does indeed result in a 404, but with lots of HTML content. urllib2 is handling it (correctly) as an error condition. You can recover the content of that site's 404 page like so:

import urllib2
try:
    print urllib2.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read()
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    print e.headers
    print e.fp.read()
Emancipation answered 22/12, 2009 at 15:50 Comment(4)
that's good to know - out of curiosity, when I type this URL into my browser, it also works. Does this mean that the browser is also receiving a 404 but just displaying the content like urllib does?Tralee
@Jerry Yes, that's what this means. You can verify this with Firebug or Safari/Chrome's Web Inspector.Hehre
I have firebug and I had checked it, but I didn't see anything that indicated a 404 - is there something special you have to do? Out of morbid curiosity, why do the browsers tolerate such poor standards? Why not just indicate that it couldn't find the file? Is this some type of trick the site it using to thwart bots - return a 404 with content knowing that browser will display the content and most bots will move on?Tralee
It's returning 404 because they have a bug in their web site, I think. A 404 can have whatever content you wish. A legitimate 404, for example, might return a site directory or the results of a text search related to the URL you typed. The browsers are doing what they're supposed to do.Emancipation

© 2022 - 2024 — McMap. All rights reserved.