I have a scrapy spider, but it doesn't return requests sometimes.
I've found that by adding log messages before yielding request and after getting response.
Spider has iterating over a pages and parsing link for item scrapping on each page.
Here is a part of code
SampleSpider(BaseSpider):
....
def parse_page(self, response):
...
request = Request(target_link, callback=self.parse_item_general)
request.meta['date_updated'] = date_updated
self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
yield request
def parse_item_general(self, response):
self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
sel = Selector(response)
...
I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"
There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.
I've also added these parameters to minimize possible errors:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8
Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response
dont_filter=True
to yourRequest
objects. Usually duplicates requests are filtered out without prior notice. There might happen that your requests get redirected to an already visited one and thus gets filtered. – Piperpiperaceous