Scrapy get request url in parse

W

5

58

How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider.

Waterer answered 19/11, 2013 at 20:7 Comment(2)

did this method work? – Dickenson 20/11, 2013 at 22:33

instead of storing them aside, during scraping you can access requested_url, check below my answer – Dendriform 13/12, 2017 at 12:19

H

111

The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything.

eg. (EDITED)

def parse(self, response):
    print "URL: " + response.request.url

Hafnium answered 25/1, 2015 at 7:50 Comment(3)

But that is not the request url, but the response url. Scrapy's middleware handles redirections, therefore you can obtain a different url. – Yearning 20/1, 2016 at 13:9

response.request.url – Fr 27/7, 2016 at 15:30

If the url has redirection, then it gives redirected url not the provided url – Dendriform 13/12, 2017 at 9:42

Y

18

The request object is accessible from the response object, therefore you can do the following:

def parse(self, response):
    item['start_url'] = response.request.url

Yearning answered 29/12, 2015 at 3:57 Comment(0)

D

11

Instead of storing requested URL's somewhere and also scrapy processed URL's are not in same sequence as provided in start_urls.

By using below,

response.request.meta['redirect_urls']

will give you the list of redirect happened like ['http://requested_url','https://redirected_url','https://final_redirected_url']

To access first URL from above list, you can use

response.request.meta['redirect_urls'][0]

For more, see doc.scrapy.org mentioned as :

RedirectMiddleware

This middleware handles redirection of requests based on response status.

The urls which the request goes through (while being redirected) can be found in the redirect_urls Request.meta key.

Hope this helps you

Dendriform answered 13/12, 2017 at 12:17 Comment(2)

I believe all you need is: redirect_urls = response.meta.get("redirect_urls") – Dorian 29/10, 2020 at 10:44

This should be the accepted answer. – Sharlenesharline 17/11, 2021 at 10:25

D

7

You need to override BaseSpider's make_requests_from_url(url) function to assign the start_url to the item and then use the Request.meta special keys to pass that item to the parse function

from scrapy.http import Request

    # override method
    def make_requests_from_url(self, url):
        item = MyItem()

        # assign url
        item['start_url'] = url
        request = Request(url, dont_filter=True)

        # set the meta['item'] to use the item in the next call back
        request.meta['item'] = item
        return request


    def parse(self, response):

        # access and do something with the item in parse
        item = response.meta['item']
        item['other_url'] = response.url
        return item

Hope that helps.

Dickenson answered 19/11, 2013 at 22:6 Comment(0)

U

3

Python 3.5

Scrapy 1.5.0

from scrapy.http import Request

# override method
def start_requests(self):
    for url in self.start_urls:
        item = {'start_url': url}
        request = Request(url, dont_filter=True)
        # set the meta['item'] to use the item in the next call back
        request.meta['item'] = item
        yield request

# use meta variable
def parse(self, response):
    url = response.meta['item']['start_url']

Umbrage answered 17/4, 2018 at 8:7 Comment(0)

Recommended topics

Hot tags