Scrapy, Splash and Connection was refused by other side: 10061
Asked Answered
P

1

2

I am using scrapy with splash on a Javascript driven site. However, I can't get passed a Connection was refused by other side: 10061 error.

I get logs like this:

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying 
 <GET https://www2.deloitte.com/ch/en/misc/search.html#country=All#qr=accounting     
 via http://localhost:8050/render.html> (failed 1 times): Connection 
 was refused by other side: 10061: No connection could be made because 
 the target machine actively refused it..

and a traceback pointing to twisted:

twisted.internet.error.ConnectionRefusedError: Connection was refused 
by other side: 10061: No connection could be made because the target 
machine actively refused it..

I have checked all the entries in settings, did try various USER_AGENTS and ROBOT entries, but no luck. Also tried to use --disable-private-mode to start splash, but no effect.

Strangely, just copy-pasting the same url into the browser works perfectly.

I used normal command line scrapy, as well as via the API. Interestingly, when using the API, of course, clicking the url of the target in the error message within PyCharm, the hashtag # is replaced by its escape-code. So I am confused whether under the hud this is another issue or whether the two are related together.

Even tried to look at the packages sent via both Wireshark and Fiddler, but was not able to understand the results well enough, as I never used these tools before.

Any suggestions would be greatly appreciated.

Perni answered 9/3, 2019 at 23:6 Comment(1)
Although I followed the install instructions for Splash to the letter, I am suspicious of the docker installation. I'm not very familiar with docker (I'm using the VM based toolbox version), but I have run some other test images (friendlyhello) in docker, which worked fine. Might there be any docker / VM configuration for Splash at play here, that I'm missing?Perni
P
5

Finally, managed to identify the culprit. It was indeed the connection to the docker container.

First, I had to retrieve the docker container IP using

docker-machine ip

in the docker terminal. Next, I had to adjust SPLASH_URL in the scrapy settings.py file to point to the docker-machine ip instead of localhost:8050, and voila ... it works.

Unfortunately, the sources I have seen so far have been rather unclear about this, so I hope this will be of some use for other poor souls setting splash up for the first time.

Perni answered 10/3, 2019 at 10:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.