We've been using scrapy-splash
middleware to pass the scraped HTML source through the Splash
javascript engine running inside a docker container.
If we want to use Splash in the spider, we configure several required project settings and yield a Request
specifying specific meta
arguments:
yield Request(url, self.parse_result, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # overrides SPLASH_URL
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
}
})
This works as documented. But, how can we use scrapy-splash
inside the Scrapy Shell?
DEFAULT_REQUEST_META
like there is a DEFAULT_REQUEST_HEADERS which would be a nice addition. There are open discussions on enabling Splash by default via a middleware (see github.com/scrapinghub/scrapy-splash/issues/11). Another option is to subclass scrapy-splash mdw and force settings there. Ideas welcome on github.com/scrapinghub/scrapy-splash/issues – Dorotheadorothee