How to un-shorten (resolve) a url using python, when final url is https?
Asked Answered
L

1

3

I am looking to unshorten (resolve) a url in python, when the final urls are https. I have seen the question: How can I un-shorten a URL using python? (as well as similar others), however as noted in the comment to the accepted answer, this solution only works when the urls is not redirected to https.

For reference, the code in that question (which works fine when redirecting to http urls) is:

# This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    resource = parsed.path
    if parsed.query != "":
        resource += "?" + parsed.query
    h.request('HEAD', resource )
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return unshorten_url(response.getheader('Location')) # changed to     process chains of short urls
    else:
        return url

(note - for obvious bandwidth reasons, I am looking to achieve via only asking for the file header's [i.e. like the http-only version above] and not by asking for the content of the whole pages)

Loquitur answered 3/4, 2015 at 2:24 Comment(0)
B
13

You can get the scheme from the url and then use HTTPSConnection if the parsed.scheme is https.
You can also use the requests library to do this very simply.

>>> import requests
>>> r = requests.head('http://bit.ly/IFHzvO', allow_redirects=True)
>>> print(r.url)
https://www.google.com
Bozuwa answered 3/4, 2015 at 2:48 Comment(1)
thanks - had to add the option "verify=False" to the request, because of ssl errors that occurred whenever redirecting between different https domains. (aware of dangers in not verifying ssl certificates)Loquitur

© 2022 - 2024 — McMap. All rights reserved.