How to load local HTML file in Scrapy Splash?
Asked Answered
C

2

7

I want to load a local HTML file using Scrapy Splash and take save it as PNG/JPEG and then delete the HTML file

script = """
splash:go(args.url)
return splash:png()
"""
resp = requests.post('http://localhost:8050/run', json={
    'lua_source': script,
    'url': 'file://my_file.html'
})
resp.content

It returns me

Failed loading page (Protocol "" is unknown) Network error #301

I have also tried

yield SplashRequest(url=filepath, 
                    callback=self.parse_result,
                    meta={'filepath': filepath},
                    args={
                        'wait': 0.5,
                        'png': 1,
                    },
                    endpoint='render.html',
                )

But I get

2020-04-23 12:07:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway

Closegrained answered 23/4, 2020 at 12:9 Comment(0)
C
0

You’re using Scrapy Splash to communicate ScrapingHub to generate the image. This only supports HTTP(s) requests. You can clone their repository and implement the changes.

Although it might be easier to serve the HTML through a web server. You can use localhost. However, if you’re running the ScrapingHub through a docker, then you’ll need to allow access to the ports.

Chat answered 24/6, 2020 at 14:43 Comment(1)
Actually "then you could serve the HTML from a web server (as the code should be able to scrape localhost)." is wrong even serving from local host doesn't workDiscus
T
0

It is not recommended to use localhost by the bottom two links. Some of the people mentioned turning off Crawlera fixed their problem. It could be trying to route your requests through online IPs to reach your localhost which would be problematic.

Scrapy Splash on Ubuntu server: got an unexpected keyword argument 'encoding'

https://github.com/scrapy-plugins/scrapy-splash/issues/108

Timon answered 25/6, 2020 at 19:32 Comment(2)
you mean by crawlera the proxy provider?Discus
Why would I using a paid proxy provider while scraping local HTML more importantly, if crawlera middleware is active you can't open local HTML with it. Your "answer" is wrong on many levels.Discus

© 2022 - 2024 — McMap. All rights reserved.