I have the following code that is partially working,
class ThreadSpider(CrawlSpider):
name = 'thread'
allowed_domains = ['bbs.example.com']
start_urls = ['http://bbs.example.com/diy']
rules = (
Rule(LinkExtractor(
allow=(),
restrict_xpaths=("//a[contains(text(), 'Next Page')]")
),
callback='parse_item',
process_request='start_requests',
follow=True),
)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_item, args={'wait': 0.5})
def parse_item(self, response):
# item parser
the code will run only for start_urls
but will not follow the links specified in restricted_xpaths
, if i comment out start_requests()
method and the line process_request='start_requests',
in the rules, it will run and follow links at intended, of course without js rendering.
I have read the two related questions, CrawlSpider with Splash getting stuck after first URL and CrawlSpider with Splash and specifically changed scrapy.Request()
to SplashRequest()
in the start_requests()
method, but that does not seem to work. What is wrong with my code?
Thanks,
restrict_xpaths=("//a[contains(text(), 'Next Page')]")
works just fine if i comment out thestart_requests()
. any way i realize this is an unsolved problem as reported by many users here: github.com/scrapy-plugins/scrapy-splash/issues/92 – Lubricator