Why am I able to read a HEAD http request in python 3 urllib.request?
Asked Answered
M

1

12

I want to make a HEAD request without any content data to conserve bandwidth. I'm using urllib.request. However, upon testing, it appears the HEAD requests also gets the data? What's going on?

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> req = urllib.request.Request("http://www.google.com", method="HEAD")
>>> resp = urllib.request.urlopen(req)
>>> a = resp.read()
>>> len(a)
24088
Mineral answered 29/3, 2015 at 9:44 Comment(4)
btw, I know other modules exists, but I'd rather understand what's going on here.Mineral
what is the response content?Compressive
It's a html page. I decided against pasting a 24kb html page here.Mineral
For what it's worth, Python 3.8 seems to correctly apply the HEAD method to the redirected request.Effulgence
R
13

The http://www.google.com URL redirects:

$ curl -D - -X HEAD http://www.google.com
HTTP/1.1 302 Found
Cache-Control: private
Content-Type: text/html; charset=UTF-8
Location: http://www.google.co.uk/?gfe_rd=cr&ei=A8sXVZLOGvHH8ge1jYKwDQ
Content-Length: 261
Date: Sun, 29 Mar 2015 09:50:59 GMT
Server: GFE/2.0
Alternate-Protocol: 80:quic,p=0.5

and urllib.request has followed the redirect, issuing a GET request to that new location:

>>> import urllib.request
>>> req = urllib.request.Request("http://www.google.com", method="HEAD")
>>> resp = urllib.request.urlopen(req)
>>> resp.url
'http://www.google.co.uk/?gfe_rd=cr&ei=ucoXVdfaJOTH8gf-voKwBw'

You'd have to build your own handler stack to prevent this; the HTTPRedirectHandler isn't smart enough to not handle a redirect when issuing a HEAD method action. Adapting the example from Alan Duan from How do I prevent Python's urllib(2) from following a redirect to Python 3 would give you:

import urllib.request

class NoRedirection(urllib.request.HTTPErrorProcessor):
    def http_response(self, request, response):
        return response
    https_response = http_response

opener = urllib.request.build_opener(NoRedirection)

req = urllib.request.Request("http://www.google.com", method="HEAD")
resp = opener.open(req)

You'd be better of using the requests library; it explicitly sets allow_redirects=False when using the requests.head() or requests.Session().head() callables, so there you can see the original result:

>>> import requests
>>> requests.head('http://www.google.com')
<Response [302]>
>>> _.headers['Location']
'http://www.google.co.uk/?gfe_rd=cr&ei=FcwXVbepMvHH8ge1jYKwDQ'

and even if redirection is enabled the response.history list gives you access to the intermediate requests, and requests uses the correct method for the redirected call too:

>>> response = requests.head('http://www.google.com', allow_redirects=True)
>>> response.url
'http://www.google.co.uk/?gfe_rd=cr&ei=8e0XVYfGMubH8gfJnoKoDQ'
>>> response.history
[<Response [302]>]
>>> response.history[0].url
'http://www.google.com/'
>>> response.request.method
'HEAD'
Rainarainah answered 29/3, 2015 at 9:51 Comment(3)
Oh...Well that answers it. Is it possible to follow with another HEAD request instead of a GET?Mineral
@eric have a look at customising a docs.python.org/3.0/library/…Tenney
@Eric: the answers from How do I prevent Python's urllib(2) from following a redirect should be readily adaptable to Python 3. Disable 302 handling then request the HEAD without a redirect. requests is far more useful here in that it lets you control redirection behaviour with a single flag then give a response.history list on the final response if you did allow redirection.Rainarainah

© 2022 - 2024 — McMap. All rights reserved.