Scrapy: how to debug scrapy lost requests - McMap

About

Scrapy: how to debug scrapy lost requests

Asked 21/12, 2013 at 20:46 Answered 29/1, 2014 at 20:1

python twisted scrapy

S

1

8

I have a scrapy spider, but it doesn't return requests sometimes.

I've found that by adding log messages before yielding request and after getting response.

Spider has iterating over a pages and parsing link for item scrapping on each page.

Here is a part of code

SampleSpider(BaseSpider):
    ....
    def parse_page(self, response):
        ...
        request = Request(target_link, callback=self.parse_item_general)
        request.meta['date_updated'] = date_updated
        self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
        yield request

    def parse_item_general(self, response):
        self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
        sel = Selector(response)
        ...

I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"

There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.

I've also added these parameters to minimize possible errors:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8

Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response

Seclusion answered 21/12, 2013 at 20:46 Comment(5)

Try disabling the offsite middleware to see what happens. – Piperpiperaceous 22/12, 2013 at 2:57

I've tried(based on this example, nothig has changed. Some requests are disappeared. from 2 to 5 from about 120 requests always disappear. – Seclusion 22/12, 2013 at 6:5

Could you provide a minimal example that reproduces this issue? Otherwise will be hard to point out what's wrong as this is not a common issue. – Piperpiperaceous 22/12, 2013 at 14:39

Alternatively, try adding dont_filter=True to your Request objects. Usually duplicates requests are filtered out without prior notice. There might happen that your requests get redirected to an already visited one and thus gets filtered. – Piperpiperaceous 22/12, 2013 at 14:41

I've tried to create short demo script and it works w/o error. So, as expected, error is somewhere in the spider code. Probably I use yield wrong with conditions. I will update question, when findout root cause – Seclusion 25/12, 2013 at 16:14

A

1

On, the same note as Rho, you can add the setting

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'

to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.

Alcoholize answered 29/1, 2014 at 20:1 Comment(1)

I was having the same issue. Somehow, I was always losing 30 requests, and always the same requests. After setting this option in my settings.py file, everything worked just fine. – Sunrise 24/4, 2017 at 20:13

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.