Scrapy + Splash = Connection Refused
Asked Answered
S

2

2

I installed Splash using this link. Followed all steps to installation, but Splash doesn't work.

My settings.py file:

BOT_NAME = 'Teste'
SPIDER_MODULES = ['Test.spiders']
NEWSPIDER_MODULE = 'Test.spiders'
DOWNLOADER_MIDDLEWARES = {
     'scrapy_splash.SplashCookiesMiddleware': 723,
     'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
SPLASH_URL = 'http://127.0.0.1:8050/'

When I run scrapy crawl TestSpider:

[scrapy.core.engine] INFO: Spider opened
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.google.com.br via http://127.0.0.1:8050/render.html> (failed 1 times): Connection was refused by other side: 111: Connection refused.
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.google.com.br via http://127.0.0.1:8050/render.html> (failed 2 times): Connection was refused by other side: 111: Connection refused.
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.google.com.br via http://127.0.0.1:8050/render.html> (failed 3 times): Connection was refused by other side: 111: Connection refused.
[scrapy.core.scraper] ERROR: Error downloading <GET http://www.google.com.br via http://127.0.0.1:8050/render.html>
Traceback (most recent call last):
     File "/home/ricardo/scrapy/lib/python3.5/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/ricardo/scrapy/lib/python3.5/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/ricardo/scrapy/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request 
defer.returnValue((yield 
download_func(request=request,spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused 
by other side: 111: Connection refused.
[scrapy.core.engine] INFO: Closing spider (finished)
[scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
'downloader/request_bytes': 1476,
'downloader/request_count': 3,
'downloader/request_method_count/POST': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 29, 21, 36, 16, 72916),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 47468544,
'memusage/startup': 47468544,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'splash/render.html/request_count': 1,
'start_time': datetime.datetime(2017, 6, 29, 21, 36, 15, 851593)}
[scrapy.core.engine] INFO: Spider closed (finished)

Here is my spider:

import scrapy
from scrapy_splash import SplashRequest

class TesteSpider(scrapy.Spider):
    name="Teste"

    def start_requests(self):
            yield SplashRequest("http://www.google.com.br", self.parse, meta={"splash": {"endpoint":"render.html",}})

    def parse(self, response):
            self.log('Hello World')

I tried to run this in terminal: curl http://localhost:8050/render.html?url=http://www.google.com/"

Output:

curl: (7) Failed to connect to localhost port 8050: Connection Refused

Sforza answered 29/6, 2017 at 22:17 Comment(4)
Are you using Docker? What command are you using to run Splash? What is your OS? What is your Docker version? If you can't access Splash on localhost:8050 then likely Docker uses a different host, or maybe you forgot to expose 8050 port.Microparasite
I'm not using Docker, but using Venv in Ubuntu 16.04. Is it necessary uses Docker?Sforza
It is not necessary to use Docker, but it is the easiest way to install Splash. You can install it to virtualenv, but it is harder. How are you starting SPlash - could you paste the exact command? Are you sure Splash is running?Microparasite
Thank you @MikhailKorobov!!! Much easier to use Docker.Sforza
P
6

You need run via command line:

sudo docker run -p 8050:8050 scrapinghub/splash

And settings.py as

SPLASH_URL = 'http://localhost:8050'
Pock answered 24/6, 2018 at 2:56 Comment(0)
B
5

please make sure your splash server is up and running before calling the spider.

sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

Bernardinebernardo answered 5/2, 2018 at 11:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.