Detect destination of shortened, or "tiny" url
Asked Answered
M

5

20

I have just scraped a bunch of Google Buzz data, and I want to know which Buzz posts reference the same news articles. The problem is that many of the links in these posts have been modified by URL shorteners, so it could be the case that many distinct shortened URLs actually all point to the same news article.

Given that I have millions of posts, what is the most efficient way (preferably in python) for me to

  1. detect whether a url is a shortened URL (from any of the many URL shortening services, or at least the largest)
  2. Find the "destination" of the shortened url, i.e., the long, original version of the shortened URL.

Does anyone know if the URL shorteners impose strict request rate limits? If I keep this down to 100/second (all coming form the same IP address), do you think I'll run into trouble?

UPDATE & PRELIMINARY SOLUTION The responses have led to to the following simple solution

import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url

That's it!

Macmullin answered 16/3, 2010 at 12:11 Comment(0)
T
17

The easiest way to get the destination of a shortened URL is with urllib. Given that the short URL is valid (response code 200), the URL be returned to you.

>>> import urllib
>>> resp = urllib.urlopen('http://bit.ly/bcFOko')
>>> resp.getcode()
200
>>> resp.url
'http://mrdoob.com/lab/javascript/harmony/'

And that's that!

Titanism answered 16/3, 2010 at 12:37 Comment(0)
G
3

(AFAIK) Most url shorteners keep track of urls already shortened, so several requests to the same engine with the same URL will return the same short code.

As has been suggested, the best way to extract the real url is to read the headers from a response to a request for the shortened URL. However, some shortening services (eg bit.ly) provide an API method to return the long url

Grecize answered 16/3, 2010 at 12:19 Comment(0)
S
1
  1. Do a list of the most used URL-shorteners and expand it while you discover new ones, then check a link for one item of the list.

  2. You do not know where the URL points to unless you follow it, so best way to do this should be to follow the shortened url and extract the http header of the response to see where it heads to.

I guess with 100 requests per second you could surely go into trouble (I guestt the worst that can happen is they blacklist your IP as a spammer).

Scum answered 16/3, 2010 at 12:15 Comment(1)
Do you know what python library and command I could use to most efficiently discover the destination URL? For example import urllib2 response = urllib2.urlopen("bit.ly/AoifeMcL_ID3") headers = response.headers.headers In this case the headers contain the domain name of the destination URL, but I don't see the complete URL...where do I need to look in the response for the destination URL?Macmullin
L
1

The posted solution only work for Python 2.x, for Python 3.x you can do this

import urllib.request as urlreq
link = urlreq.urlopen("http://www.google.com")
fullURL = link.url

to get the full URL.

Luxe answered 8/7, 2016 at 4:24 Comment(0)
H
0

From what I have read, these answers addressed the second question. I was interested in the first question. After viewing a list of about 300 shorteners it seems the best way to detect them is to simply put them into a list or regex and look for a match with any of them.

"|".join(z1)
'0rz.tw|1link.in|1url.com|2.gp|2big.at    
r1 = re.compile("|".join(z1),flags=ic)

Then using r1 to match as a regex against whatever you are trying to find the url shorteners in (mail, etc...)

A very good list is here: longurl.org/services

Hemingway answered 8/5, 2014 at 17:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.