Change IP address dynamically? [closed]
Asked Answered
C

5

62

Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit.

So, how can change my IP address dynamically or any other ideas?

Corney answered 4/3, 2015 at 10:27 Comment(0)
C
53

An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware.

Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90,
    'tutorial.randomproxy.RandomProxy': 100,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'tutorial.spiders.rotate_useragent.RotateUserAgentMiddleware' :400,    
}

Random Proxy

You can use scrapy-proxies. This component will process Scrapy requests using a random proxy from a list to avoid IP ban and improve crawling speed.

You can build up your proxy list from a quick internet search. Copy links in the list.txt file according to requested url format.

Rotation of user agent

For each scrapy request a random user agent will be used from a list you define in advance:

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

            # Add desired logging message here.
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                level=log.DEBUG
            )

    # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

More details here.

Chivers answered 5/3, 2015 at 11:38 Comment(4)
Thanks for the solution that helps me a lot. Can you please elaborate the role of DOWNLOADER_MIDDLEWARES here. @ChiversSharyl
In the RotateUserAgentMiddleware, only one useragent is assigning for a spider, i.e, I am crawling with recursive calls, for every call I wanted to change the User-Agent with is not happening with the above code. Please help me out.Sharyl
how to get the proxy list? anybody? helpOmmiad
Have you tried to search on google for "free proxy list" ?Chivers
R
7

If you are using R, you could do the web crawling through TOR. I think TOR resets its IP-adress every 10 minutes(?) automatically. I think there is a way forcing TOR to change the IP in shorter intervals, but that didn't work for me. Instead you could set up multiple instances of TOR and then switch between the independent instances (here you can find a good explaination of how to set up multiple instances of TOR: https://tor.stackexchange.com/questions/2006/how-to-run-multiple-tor-browsers-with-different-ips)

After that you could do something like the following in R (use the ports of your independent TOR browsers and a list of useragents. Every time you call the 'getURL'-function cycle through your list of ports/useragents)

library(RCurl)

port <- c(a list of your ports)
proxy <- paste("socks5h://127.0.0.1:",port,sep="")
ua <- c(a list of your useragents)

opt <- list(proxy=sample(proxy,1),
            useragent=sample(ua,1),
            followlocation=TRUE,
            referer="",
            timeout=timeout,
            verbose=verbose,
            ssl.verifypeer=ssl)

webpage <- getURL(url=url,.opts=opt)
Raucous answered 18/7, 2016 at 17:29 Comment(0)
Q
3

Some VPN applications allow you to automatically change your IP address to a new random IP address at a set interval such as: every 2 minutes. Both HMA! Pro VPN and VPN4ALL software support this feature.

Quieten answered 5/3, 2015 at 14:36 Comment(0)
A
0

Word of warning about VPNs, check their Terms and Conditions carefully because scraping using them goes against their user policy ( One such example would be Astrill). I tried a scraping tool and got my account locked

Anthropogeography answered 29/7, 2018 at 21:14 Comment(0)
S
-1

If you have public IPs. Add them on your interface and if you are using Linux use Iptables for switching those public IPs.

Iptables sample rules for two IPs

iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.2

iptables -t nat -A POSTROUTING -m statistic --mode random --probability 0.5 -j SNAT --to-source 192.168.0.3

If you have 4 IPs then probablity will become 0.25.

You can also create your own proxy with simple steps.

These rules will allow the proxy server to switch its outgoing IPS.

Special answered 31/7, 2018 at 12:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.