This question necessarily comes in two forms, because I don't know the better route to a solution.
A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...
When Scrapy receives a "Redirect" to its request (e.g.
DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)
), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)What is the most efficient solution?
(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)
(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.
For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://open.proxymesh.com:[port number]"
proxy_user_pass = "username:password"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass