CrawlSpider with Splash getting stuck after first URL
Asked Answered
D

2

4

I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong?

class VideoSpider(CrawlSpider):

    start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2']

rules = (
    Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",),
)

def use_splash(self, request):
    request.meta['splash'] = {
            'endpoint':'render.html',
            'args':{
                'wait':0.5,
                }
            }     
    return request

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, self.parse, meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
        }
    })  


def parse_items(self, response):      
    data = response.body
    print(data)
Debroahdebs answered 22/6, 2016 at 21:15 Comment(1)
Does this answer your question? CrawlSpider with SplashRayerayfield
P
1

Use SplashRequest instead of scrapy.Request... Check out my answer CrawlSpider with Splash

Prison answered 25/3, 2017 at 18:41 Comment(0)
B
0
def use_splash(self, request):
request.meta['splash'] = {
        'endpoint':'render.html',
        'args':{
            'wait':0.5,
            }
        }     
return request

You should amend it to

def use_splash(self, request):
    return SplashRequest(xxxxxx)

or you can rewrite this function

    def _build_request(self, rule, link):
        r = Request(url=link.url, callback=self._response_downloaded)
        r.meta.update(rule=rule, link_text=link.text)
        return r

I can't guarantee it will work.I'm watching this, too.

Beestings answered 6/3, 2019 at 10:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.