how to handle 302 redirect in scrapy
Asked Answered
D

6

20

I am receiving a 302 response from a server while scrapping a website:

2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>

I want to send request to GET urls instead of being redirected. Now I found this middleware:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31

I added this redirect code to my middleware.py file and I added this into settings.py:

DOWNLOADER_MIDDLEWARES = {
 'street.middlewares.RandomUserAgentMiddleware': 400,
 'street.middlewares.RedirectMiddleware': 100,
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

But I am still getting redirected. Is that all I have to do in order to get this middleware working? Do I miss something?

Deckard answered 1/4, 2014 at 19:42 Comment(4)
They are probably redirecting you endlessly to prevent you from scraping the site. At least, that's what the URL makes me believe.Schriever
Yeah thats obviously their attend and the reason why I posted this question. Its not an endless loop its simple a 302 redirect, the original url is still received as a GET: from <GET domain.com/wps/…> and that is the URL I want to send my request to. As far as I can read that is possible and I found a script for that, but for some reason my settings are not working.Deckard
I didn't mean it's an endless loop. I meant that every time you make a request, you are redirected, so they refuse to give you the content.Schriever
the response header will return both urls 302 and the correct one, you just need to drop the 302 one and you have to take the other one, which is exactly the one you want... en.wikipedia.org/wiki/HTTP_302 for more info about 302 headersDeckard
D
18

Forgot about middlewares in this scenario, this will do the trick:

meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}

That said, you will need to include meta parameter when you yield your request:

yield Request(item['link'],meta = {
                  'dont_redirect': True,
                  'handle_httpstatus_list': [302]
              }, callback=self.your_callback)
Deckard answered 14/4, 2014 at 21:6 Comment(3)
Didn't worked for me with the current Scrapy version, I tried with other codes in the handle_httpstatus_list like 404, and work fine. It just don't work with 301 and 302. Any ideas?Rita
@mrki How to hanlde redirection manually for start url, means if start_urls is redirected to some where else?Jakob
'handle_httpstatus_list': [302] works in scrapy==1.4.0Xuthus
C
6

An unexplicable 302 response, such as redirecting from a page that loads fine in a web browser to the home page or some fixed page, usually indicates a server-side measure against undesired activity.

You must either reduce your crawl rate or use a smart proxy (e.g. Crawlera) or a proxy-rotation service and retry your requests when you get such a response.

To retry such a response, add 'handle_httpstatus_list': [302] to the meta of the source request, and check if response.status == 302 in the callback. If it is, retry your request by yielding response.request.replace(dont_filter=True).

When retrying, you should also make your code limit the maximum number of retries of any given URL. You could keep a dictionary to track retries:

class MySpider(Spider):
    name = 'my_spider'

    max_retries = 2

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.retries = {}

    def start_requests(self):
        yield Request(
            'https://example.com',
            callback=self.parse,
            meta={
                'handle_httpstatus_list': [302],
            },
        )

    def parse(self, response):
        if response.status == 302:
            retries = self.retries.setdefault(response.url, 0)
            if retries < self.max_retries:
                self.retries[response.url] += 1
                yield response.request.replace(dont_filter=True)
            else:
                self.logger.error('%s still returns 302 responses after %s retries',
                                  response.url, retries)
            return

Depending on the scenario, you might want to move this code to a downloader middleware.

Chastain answered 21/11, 2019 at 10:13 Comment(0)
V
2

I had an issue with infinite loop on redirections when using HTTPCACHE_ENABLED = True. I managed to avoid the problem by setting HTTPCACHE_IGNORE_HTTP_CODES = [301,302].

Vallievalliere answered 26/3, 2015 at 12:54 Comment(1)
only your solution worked in my case after i changed the settings.py to the following HTTPCACHE_ENABLED = False HTTPCACHE_IGNORE_HTTP_CODES = [301,302]Terri
T
2

You can disable the RedirectMiddleware by setting REDIRECT_ENABLED to False in settings.py

Throve answered 4/2, 2019 at 18:36 Comment(0)
K
2

I figured out how to bypass redirect by the following:

1- check if am redirected in parse().

2- if redirected, then arrange to simulate the action of escaping this redirection and return back to your required URL for scraping, you may need to check Network behavior in google chrome and simulate the POST of a request to get back to your page.

3- go to another process , using callback, and then be within this process to complete all scraping work by recursive loop calling itself, and put condition to break this loop at the end.

below example I used to bypass Disclaimer page and return back to my main url and start scraping.

from scrapy.http import FormRequest
import requests


class ScrapeClass(scrapy.Spider):

name = 'terrascan'

page_number = 0


start_urls = [
    Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
              ]


def parse(self, response):

    ''' Here I killed Disclaimer page and continued in below proc with follow !!!'''

    # Get Currently Requested URL
    current_url = response.request.url

    # Get All Followed Redirect URLs
    redirect_url_list = response.request.meta.get('redirect_urls')
    # Get First URL Followed by Spiders
    redirect_url_list = response.request.meta.get('redirect_urls')[0]

    # handle redirection as below  ( check redirection !! , got it from redirect.py
    # in \downloadermiddlewares  Folder

    allowed_status = (301, 302, 303, 307, 308)
    if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
        
        print(current_url, '<========= am not redirected @@@@@@@@@@')
    else:
       
        print(current_url, '<====== kill that please %%%%%%%%%%%%%')
        
        session_requests = requests.session()


        # got all below data from monitoring network behavior in google chrome when simulating clicking on 'I Agree'

        headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',

                    'ctl00$cphContent$btnAgree': 'I Agree'
                    }
        # headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}

        # Post_ = session_requests.post(current_url, headers=headers_)
        Post_ = session_requests.post(current_url, headers=headers_)

        # if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')

        print(response.url , '<========= check this please')



        return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)



def parse_After_disclaimer(self, response):

    print(response.status)
    print(response.url)

    # put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection 

    if response.url not in [your lis of URLs]:
        print('I am here brother')
        yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)

    else:
      
        # here you are good to go for scraping work          
        items = TerrascanItem()

        all_td_tags = response.css('td')
        print(len(all_td_tags),'all_td_results',response.url)

        # for tr_ in all_tr_tags:
        parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
        Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()

     
        if parcel_No:items['parcel_No'] = parcel_No
        else: items['parcel_No'] =''


        yield items

    # Here you put the condition to recursive call of this process again
    
    #
    ScrapeClass.page_number += 1
    # next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
    next_page = Your URLS[ScrapeClass.page_number]
    print('am in page #', ScrapeClass.page_number, '===', next_page)
    if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1:  # 20
        # print('I am loooooooooooooooooooooooping again')
        yield response.follow(next_page, callback=self.parse_After_disclaimer)
Kelsey answered 10/6, 2020 at 5:16 Comment(0)
M
1

I added this redirect code to my middleware.py file and I added this into settings.py:

DOWNLOADER_MIDDLEWARES_BASE says that RedirectMiddleware is already enabled by default, so what you did didn't matter.

I want to send request to GET urls instead of being redirected.

How? The server responds with 302 on your GET request. If you do GET on the same URL again you will be redirected again.

What are you trying to achieve?

If you want to not be redirected, see these questions:

Mccarter answered 2/4, 2014 at 7:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.