How can I unshorten a URL?
Asked Answered
S

10

23

I want to be able to take a shortened or non-shortened URL and return its un-shortened form. How can I make a python program to do this?

Additional Clarification:

  • Case 1: shortened --> unshortened
  • Case 2: unshortened --> unshortened

e.g. bit.ly/silly in the input array should be google.com in the output array
e.g. google.com in the input array should be google.com in the output array

Sandman answered 17/11, 2010 at 2:56 Comment(2)
Are you talking about a specific URL shortening service, and does this service have an API you can retrieve the info from?Delaminate
If you are in a hurry, you could also use this API rapidapi.com/logicione/api/url-expander1Aggy
B
40

Send an HTTP HEAD request to the URL and look at the response code. If the code is 30x, look at the Location header to get the unshortened URL. Otherwise, if the code is 20x, then the URL is not redirected; you probably also want to handle error codes (4xx and 5xx) in some fashion. For example:

# This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    h.request('HEAD', parsed.path)
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return response.getheader('Location')
    else:
        return url
Bus answered 17/11, 2010 at 3:20 Comment(3)
ignores url query, better version here: https://mcmap.net/q/584051/-how-can-i-un-shorten-a-url-using-pythonTurfman
do note when using above code does not unshorten recursively in case you want to obtain the actual URL. Try on http://t.co/hAplNMmSTg. You need to do return unshorten_url(response.getheader('Location')) for recursivity.Herculaneum
Possibly also keep track of previous urls in a set to prevent cyclic recursion.Purify
P
34

Using requests:

import requests

session = requests.Session()  # so connections are recycled
resp = session.head(url, allow_redirects=True)
print(resp.url)
Philter answered 7/3, 2015 at 18:0 Comment(3)
I like this solution, it automatically follows multiple redirectsGooding
I had to set verify=False as Requests could not validate the certDemers
Is there a way to have requests display the url of each redirect?Tuberose
H
5

Unshorten.me has an api that lets you send a JSON or XML request and get the full URL returned.

Heder answered 17/11, 2010 at 3:0 Comment(0)
D
5

If you are using Python 3.5+ you can use the Unshortenit module that makes this very easy:

from unshortenit import UnshortenIt
unshortener = UnshortenIt()
uri = unshortener.unshorten('https://href.li/?https://example.com')
Disparity answered 4/5, 2020 at 7:51 Comment(0)
L
4

Open the url and see what it resolves to:

>>> import urllib2
>>> a = urllib2.urlopen('http://bit.ly/cXEInp')
>>> print a.url
http://www.flickr.com/photos/26432908@N00/346615997/sizes/l/
>>> a = urllib2.urlopen('http://google.com')
>>> print a.url
http://www.google.com/
Lingulate answered 17/11, 2010 at 3:19 Comment(3)
This does a GET of the whole page. If the page isn't a redirect and happens to be very large, you're wasting a huge amount of bandwidth just to determine that it's not a redirect. Much better to use a HEAD request instead.Bus
@Adam Rosenfeld: It's probably an appropriate answer for a side project for someone beginning python. I don't recommend that Google or Yahoo spider pages like this to find the real URL.Lingulate
It is a NOT GOOD IDEA doing this. You wasting a lot of bandwidth. Just using unshort.me api is better and faster as @Heder suggestedMenstruate
P
4

To unshort, you can use requests. This is a simple solution that works for me.

import requests
url = "http://foo.com"

site = requests.get(url)
print(site.url)
Promethium answered 1/5, 2017 at 0:3 Comment(0)
T
1

http://github.com/stef/urlclean

sudo pip install urlclean
urlclean.unshorten(url)
Turfman answered 12/7, 2013 at 13:34 Comment(1)
Unfortunately this is python 2 only, and why would one write unparenthised print's in python code in 2012 :(Purify
P
1

Here a src code that takes into account almost of the useful corner cases:

  • set a custom Timeout.
  • set a custom User Agent.
  • check whether we have to use an http or https connection.
  • resolve recursively the input url and prevent ending within a loop.

The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

comments are welcome ...

import logging
logging.basicConfig(level=logging.DEBUG)

TIMEOUT = 10
class UnShortenUrl:
    def process(self, url, previous_url=None):
        logging.info('Init url: %s'%url)
        import urlparse
        import httplib
        try:
            parsed = urlparse.urlparse(url)
            if parsed.scheme == 'https':
                h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
            else:
                h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
            resource = parsed.path
            if parsed.query != "": 
                resource += "?" + parsed.query
            try:
                h.request('HEAD', 
                          resource, 
                          headers={'User-Agent': 'curl/7.38.0'}
                                   }
                          )
                response = h.getresponse()
            except:
                import traceback
                traceback.print_exec()
                return url

            logging.info('Response status: %d'%response.status)
            if response.status/100 == 3 and response.getheader('Location'):
                red_url = response.getheader('Location')
                logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                if red_url == previous_url:
                    return red_url
                return self.process(red_url, previous_url=url) 
            else:
                return url 
        except:
            import traceback
            traceback.print_exc()
            return None
Pinpoint answered 15/7, 2015 at 21:22 Comment(3)
If I understand your flow correctly, you might want to put a cap on how many redirects you'll tolerateCalender
@Calender in some cases, the redirect points to the same previous url, so, to prevent the trap of an infinite loop, i propagate the previous url within the recusive call and if i end up with red_url == previous_url, i stop and return that url. Otherwise, in a normal case, at some iteration, the response.status will not be equal anymore to a redirection status, so, we return the retrieved url.Pinpoint
@AmirKrifa does that handle link.foo which points to link.bar which points back to link.foo? (I don't know httplib to know if there's an option to follow redirects, in which case, this sort of link would throw an exception before you called the recursive call)Calender
D
1

You can use geturl()

from urllib.request import urlopen
url = "bit.ly/silly"
unshortened_url = urlopen(url).geturl()
print(unshortened_url)
# google.com
Dragon answered 17/6, 2020 at 7:23 Comment(0)
B
0

This Is very easy task you just need to add 4 lines of codes thats it :)

import requests
url = input('Enter url : ')
site = requests.get(url)
print(site.url)

just run this code you will successfully unshort the url.

Betthezel answered 3/9, 2021 at 17:14 Comment(1)
It's the same as this answer: https://mcmap.net/q/554101/-how-can-i-unshorten-a-urlDeviate

© 2022 - 2024 — McMap. All rights reserved.