I am using Scrapy with Privoxy and Tor. Here is my previous question Scrapy with Privoxy and Tor: how to renew IP, and here is the spider:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "****"
start_urls = [
'https://****.com/listviews/titles.php',
]
allowed_domains = ["****.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('main#main'):
yield {
'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
}
In settings.py I have an user agent rotation and privoxy:
DOWNLOADER_MIDDLEWARES = {
#user agent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'****.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
#privoxy
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'****.middlewares.ProxyMiddleware': 100
}
In middlewares.py I added:
from stem import Signal
from stem.control import Controller
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='tor_password')
controller.signal(Signal.NEWNYM)
class ProxyMiddleware(object):
def process_request(self, request, spider):
_set_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
If I take out the def _set_new_ip():
method of the class in middlewares.py (and the call to it in class ProxyMiddleware(object):
the spider works. But I want the spider to call for a new IP each time, and that's why I added it. The problem is that each time I try to run the spider it returns an error SocketError: [Errno 61] Connection refused
, with this traceback:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "/Users/nikita/scrapy/***/***/middlewares.py", line 71, in process_request
_set_new_ip()
File "/Users/nikita/scrapy/***/***/middlewares.py", line 65, in _set_new_ip
with Controller.from_port(port=9051) as controller:
File "/usr/local/lib/python2.7/site-packages/stem/control.py", line 998, in from_port
control_port = stem.socket.ControlPort(address, port)
File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 372, in __init__
self.connect()
File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 243, in connect
self._socket = self._make_socket()
File "/usr/local/lib/python2.7/site-packages/stem/socket.py", line 401, in _make_socket
raise stem.SocketError(exc)
SocketError: [Errno 61] Connection refused
2017-07-11 15:50:28 [scrapy.core.engine] INFO: Closing spider (finished)
Maybe the problem is in the port used in with Controller.from_port(port=9051) as controller:
, but I am not sure. If anybody has an idea that would be fantastic…
EDIT---
Ok, if I go to the browser and go to http://127.0.0.1:8118/, it sais:
503
This is Privoxy 3.0.26 on localhost (127.0.0.1), port 8118, enabled
Forwarding failure
Privoxy was unable to socks5-forward your request http://127.0.0.1:8118/ through localhost: SOCKS5 request failed
Just try again to see if this is a temporary problem, or check your forwarding settings and make sure that all forwarding servers are working correctly and listening where they are supposed to be listening.
So maybe it is related to the configuration of SOCKS5… Anyone knows?
stem
. – Heartburningauthenticate()
function. In the example they give first they make acontrol_socket = stem.socket.ControlPort(port = 9051)
, and after that thenstem.connection.authenticate(control_socket)
. Should I put both of them in theProxyMiddleware
class? – Planosporeconnect()
function somewhere, but, where? I tried some options but none were succesful… – Planospore