How to detect when a site check redirects to another page using the requests module?
Asked Answered
J

4

19

For example, if I go to www.yahoo.com/thispage, and yahoo has set up a filter to redirect /thispage to /thatpage. So whenever someone goes to /thispage, they will land on /thatpage.

If I use httplib/requests/urllib, will it know that there was a redirection? What error pages? Some sites redirect user to /errorpage whenever the page cannot be found.

Jackson answered 20/11, 2012 at 21:47 Comment(4)
What is the problem you are trying to solve? How is your code not doing the right thing? If you merely want to know about error modes, test this behaviour yourself.Barrettbarrette
Check #554946Walking
@Barrettbarrette I have a huge list(1k+) of urls to test if they are up or not. I randomly chose 40-50 of them to test manually, I see that some are getting redirected to an error page whenever a page cannot be found. Also I see many urls been redirected as well because the url pattern has changed, same names just written differently.Jackson
@Walking that sorta looks like what i need, ill check it out. thanks!Jackson
C
29

With requests, you get a listing of any redirects in the .history attribute of the response object. It returns a Python list. See the documentation for more.

Celestinecelestite answered 20/11, 2012 at 22:3 Comment(0)
F
19

To prevent requests from following redirects use:

r = requests.get('http://www.yahoo.com/thispage', allow_redirects=False)

If it is in indeed a redirect, you can check the redirect target location in r.headers['location'].

Freund answered 20/11, 2012 at 22:6 Comment(0)
R
3

The accepted answer is the correct first option, but in some cases if the site redirects with a meta tag they also have a canonical link specified once they redirect. In this example let me try to request http://en.wikipedia.org/wiki/Google_Inc_Class_A from wikipedia, which is a url that redirects.

>> request = requests.get('http://en.wikipedia.org/wiki/Google_Inc_Class_A')

I check and:

>> request.history
[]

An alternative is to try and pull the canonical url which should hopefully have what you're been redirected to. (Note I'm using BeautifulSoup here as well)

>> soup = BeautifulSoup(request._content)
>> canonical = soup.find('link', {'rel': 'canonical'})
>> canonical['href']
'http://en.wikipedia.org/wiki/Google'

Which does match the url you get redirected to in this particular case. So to be clear, this is an ugly second option but worth trying if all else fails.

Richella answered 25/11, 2014 at 4:44 Comment(1)
For future readers: I just checked this example and the history is correctly populated: requests.get('http://en.wikipedia.org/wiki/Google_Inc_Class_A', allow_redirects=True). I don't know if it's due to "allow_redirects" parameters or to a new version of requests package.Ilo
L
2

It depends on how they are doing the redirection. The "right" way is to return a redirected HTTP status code (301/302/303). The "wrong" way is to place a refresh meta tag in the HTML.

If they do the former, requests will handle it transparently. Note that any sane error page redirect will still have an error status code (e.g. 404) which you can check as response.status_code.

Lavish answered 20/11, 2012 at 22:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.