How do I prevent Python's urllib(2) from following a redirect
Asked Answered
I

4

47

I am currently trying to log into a site using Python however the site seems to be sending a cookie and a redirect statement on the same page. Python seems to be following that redirect thus preventing me from reading the cookie send by the login page. How do I prevent Python's urllib (or urllib2) urlopen from following the redirect?

Irreconcilable answered 16/2, 2009 at 20:29 Comment(3)
Duplicate: #110998Rimskykorsakov
a similar question: #9891315Janssen
For readers who don't care about using urllib specificially. requests supports this "out of the box" #110998Launder
M
33

You could do a couple of things:

  1. Build your own HTTPRedirectHandler that intercepts each redirect
  2. Create an instance of HTTPCookieProcessor and install that opener so that you have access to the cookiejar.

This is a quick little thing that shows both

import urllib2

#redirect_handler = urllib2.HTTPRedirectHandler()

class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_302(self, req, fp, code, msg, headers):
        print "Cookie Manip Right Here"
        return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)

    http_error_301 = http_error_303 = http_error_307 = http_error_302

cookieprocessor = urllib2.HTTPCookieProcessor()

opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)

response =urllib2.urlopen("WHEREEVER")
print response.read()

print cookieprocessor.cookiejar
Morell answered 16/2, 2009 at 21:13 Comment(4)
You don't seem to be using redirect_handler = urllib2.HTTPRedirectHandler() in the example at all. Were you going to show a second example?Skirl
You are correct, I'm not using the redirect_handler. Instead, I created my own redirect handler. I will edit to remove.Morell
Why is it you do not need to instantiate the MyHTTPRedirectHandler, but rather pass the class into the build_opener() method?Habit
From the documentation: handlers can be either instances of BaseHandler, or subclasses of BaseHandler (in which case it must be possible to call the constructor without any parameters). Since MyHTTPRedirectHandler doesn't have a constructor with any arguments, I can pass it in as is.Morell
F
30

If all you need is stopping redirection, then there is a simple way to do it. For example I only want to get cookies and for a better performance I don't want to be redirected to any other page. Also I hope the code is kept as 3xx. let's use 302 for instance.

class MyHTTPErrorProcessor(urllib2.HTTPErrorProcessor):

    def http_response(self, request, response):
        code, msg, hdrs = response.code, response.msg, response.info()

        # only add this line to stop 302 redirection.
        if code == 302: return response

        if not (200 <= code < 300):
            response = self.parent.error(
                'http', request, response, code, msg, hdrs)
        return response

    https_response = http_response

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MyHTTPErrorProcessor)

In this way, you don't even need to go into urllib2.HTTPRedirectHandler.http_error_302()

Yet more common case is that we simply want to stop redirection (as required):

class NoRedirection(urllib2.HTTPErrorProcessor):

    def http_response(self, request, response):
        return response

    https_response = http_response

And normally use it this way:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(NoRedirection, urllib2.HTTPCookieProcessor(cj))
data = {}
response = opener.open('http://www.example.com', urllib.urlencode(data))
if response.code == 302:
    redirection_target = response.headers['Location']
Fluidize answered 31/7, 2012 at 16:33 Comment(9)
Just what I needed, and very concise class NoRedirection() - you don't even have to store code, msg, hdrs -- Thanks Alan.Lapidate
You are right! And I removed the line as you suggested. Thanks Xtof.Fluidize
Is it possible to use this approach to get hold of the actual redirect URL?Brunner
@Malvin9000 If you want to get the target of the redirection, then yes, just read response.headers['Location'], you will get it:)Fluidize
@Malvin9000 Not literally using read, you can assign it to a new variable or directly print it out. Let me update the answer so you can see.Fluidize
@AlanDuan Thanks a lot for the edit update, much appreciated. When I print redirection_target I see the URL I'm inserting in opener.open() instead of the new URL that appears in my browser when I cut-and-paste the original URL. Not sure what I'm doing wrong...Brunner
@Malvin9000 most probably it redirects to itself. It happens when the url supports both GET and POST methods, when you POST some data not accepted, it directs back to itself using GET method. To get what exactly happen, you can use developer tools in Chrome or Firefox to trace every step, (call it out via CTRL+SHIFT+I in Chrome, then select Network tab).Fluidize
@AlanDuan This post is pretty much exactly what I'm trying to accomplish, same HTTP header data, etc, trying to get that value of location — but maybe it's not possible using raw requests.Brunner
Let us continue this discussion in chat.Fluidize
I
12

urllib2.urlopen calls build_opener() which uses this list of handler classes:

handlers = [ProxyHandler, UnknownHandler, HTTPHandler,
HTTPDefaultErrorHandler, HTTPRedirectHandler,
FTPHandler, FileHandler, HTTPErrorProcessor]

You could try calling urllib2.build_opener(handlers) yourself with a list that omits HTTPRedirectHandler, then call the open() method on the result to open your URL. If you really dislike redirects, you could even call urllib2.install_opener(opener) to your own non-redirecting opener.

It sounds like your real problem is that urllib2 isn't doing cookies the way you'd like. See also How to use Python to login to a webpage and retrieve cookies for later usage?

Infanta answered 16/2, 2009 at 20:38 Comment(1)
You could try calling urllib2.build_opener(handlers) yourself with a list that omits HTTPRedirectHandler, then call the open() method on the result to open your URL. Well, docs for urllib2.build_opener() say this Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor. It looks like ommiting HTTPRedirectHandler won't work...Wife
N
4

This question was asked before here.

EDIT: If you have to deal with quirky web applications you should probably try out mechanize. It's a great library that simulates a web browser. You can control redirecting, cookies, page refreshes... If the website doesn't rely [heavily] on JavaScript, you'll get along very nicely with mechanize.

Nieshanieto answered 16/2, 2009 at 20:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.