Scrapy Shell and Scrapy Splash

Asked 11/2, 2016 at 23:56 Answered 7/7, 2019 at 5:40

Solved web-scraping scrapy scrapy-splash scrapy-shell splash-js-render

We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container.

If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

This works as documented. But, how can we use scrapy-splash inside the Scrapy Shell?

Occasion answered 11/2, 2016 at 23:56 Comment(1)

It's true there's no DEFAULT_REQUEST_META like there is a DEFAULT_REQUEST_HEADERS which would be a nice addition. There are open discussions on enabling Splash by default via a middleware (see github.com/scrapinghub/scrapy-splash/issues/11). Another option is to subclass scrapy-splash mdw and force settings there. Ideas welcome on github.com/scrapinghub/scrapy-splash/issues – Dorotheadorothee 12/2, 2016 at 12:57

just wrap the URL you want to shell to in splash HTTP API.

So you would want something like:

scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5'

where:

localhost:port is where your splash service is running
url is URL you want to crawl and don't forget to urlquote it!
render.html is one of the possible HTTP API endpoints, returns redered HTML page in this case
timeout time in seconds for timeout
wait time in seconds to wait for JavaScript to execute before reading/saving the HTML.

Aziza answered 12/2, 2016 at 9:54 Comment(3)

you can probably make a bash alias to make this more convenient. – Aziza 12/2, 2016 at 10:1

@StephenOstermiller you just uppercased some words and ruined the formatting. – Aziza 11/7, 2022 at 10:53

Something is funky with the markdown formatting, I've never seen trailing white space introduce new lines in the output. Using list formatting will prevent preserve the new lines. I also use example.com instead of a non-example .com which is the main reason for the edit. – Pincenez 11/7, 2022 at 10:58

You can run scrapy shell without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...) and call fetch(req).

Wilkerson answered 20/4, 2016 at 13:42 Comment(0)

For the windows users, who use Docker Toolbox:

Change the single inverted comma with double inverted comma for preventing the invalid hostname:http error.
change the localhost to the docker IP address which is below the whale logo. for me it was 192.168.99.100.

Finally I got this:

scrapy shell "http://192.168.99.100:8050/render.html?url="https://example.com/category/banking-insurance-financial-services/""

Roorback answered 7/7, 2019 at 5:40 Comment(0)

Recommended topics

Hot tags