How to run splash using docker toolbox
Asked Answered
U

1

0

I am trying out scrapy with splash to scrape dynamic content off the web, I'm on a windows 10 Home Edition. Is there an way to use Docker tool box instead of docker-desktop so as to work with splash?

The docker toolbox says, it is an alternative for systems that cannot run docker-desktop. The docker desktop app is essential for splash and it requires windows 10 pro or an enterprise edition.

I could not find a way for configuring docker-toolbox for splash. Is there any guideline that could be of help for configuring splash using docker-toolbox on my windows 10 home edition. Thanks!

Unattended answered 15/4, 2019 at 23:59 Comment(0)
K
2

It will work fine with docker-toolbox too. Just follow the similar process and make sure you provide the docker global url which you can access through:

docker-machine ip default

If you don't know the process, here is the one way, you can use the scrapy-splash:

Run the splash on docker

# Install Docker 'http://docker.io/'
# Pull the image:
    $ sudo docker pull scrapinghub/splash
# Start the container:
    $ sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
# Splash is now available at 192.168.99.100 at ports 8050 (http) and 5023 (telnet).

Insert into scrapy.setting

# splash for scripting or js dependent web-page
# Access docker-machine ip=> docker-machine ip default

SPLASH_URL = '<docker-hosted machine ip:port>' # docker url
#SPLASH_URL = 'http://192.168.99.100:8050' # docker url

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

parsing the url.. add in a splash meta before yielding in the url

response.meta['splash'] = {'args': { 'html': 1, 'png': 1}, 'endpoint': 'render.json'}
yield scrapy.Request(response.url, callback=self.parse_page, meta=response.meta)

For more details check this document:
javascript-in-scrapy-with-splash
splash-through-http-api

Kevyn answered 16/4, 2019 at 5:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.