Scrapy: how to debug scrapy lost requests
Asked Answered
S

1

8

I have a scrapy spider, but it doesn't return requests sometimes.

I've found that by adding log messages before yielding request and after getting response.

Spider has iterating over a pages and parsing link for item scrapping on each page.

Here is a part of code

SampleSpider(BaseSpider):
    ....
    def parse_page(self, response):
        ...
        request = Request(target_link, callback=self.parse_item_general)
        request.meta['date_updated'] = date_updated
        self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
        yield request

    def parse_item_general(self, response):
        self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
        sel = Selector(response)
        ...

I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"

There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.

I've also added these parameters to minimize possible errors:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8

Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response

Seclusion answered 21/12, 2013 at 20:46 Comment(5)
Try disabling the offsite middleware to see what happens.Piperpiperaceous
I've tried(based on this example, nothig has changed. Some requests are disappeared. from 2 to 5 from about 120 requests always disappear.Seclusion
Could you provide a minimal example that reproduces this issue? Otherwise will be hard to point out what's wrong as this is not a common issue.Piperpiperaceous
Alternatively, try adding dont_filter=True to your Request objects. Usually duplicates requests are filtered out without prior notice. There might happen that your requests get redirected to an already visited one and thus gets filtered.Piperpiperaceous
I've tried to create short demo script and it works w/o error. So, as expected, error is somewhere in the spider code. Probably I use yield wrong with conditions. I will update question, when findout root causeSeclusion
A
1

On, the same note as Rho, you can add the setting

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter' 

to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.

Alcoholize answered 29/1, 2014 at 20:1 Comment(1)
I was having the same issue. Somehow, I was always losing 30 requests, and always the same requests. After setting this option in my settings.py file, everything worked just fine.Sunrise

© 2022 - 2024 — McMap. All rights reserved.