How to rotate proxies on a Python requests
Asked Answered
S

4

19

I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?

Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.

from fake_useragent import UserAgent
import requests

def get_player(id,proxy):
    ua=UserAgent()
    headers = {'User-Agent':ua.random}

    url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)

    try:
        print(proxy)
        r=requests.get(u,headers=headers,proxies=proxy)
    execpt:

....
code to manage the data
....

Getting proxies

def get_proxies():
    ua=UserAgent()
    headers = {'User-Agent':ua.random}
    url='https://free-proxy-list.net/'

    r=requests.get(url,headers=headers)
    page = BeautifulSoup(r.text, 'html.parser')

    proxies=[]

    for proxy in page.find_all('tr'):
        i=ip=port=0

    for data in proxy.find_all('td'):
        if i==0:
            ip=data.get_text()
        if i==1:
            port=data.get_text()
        i+=1

    if ip!=0 and port!=0:
        proxies+=[{'http':'http://'+ip+':'+port}]

return proxies

Calling functions

proxies=get_proxies()
for i in range(1,100):
    player=get_player(i,proxies[i//4])

....
code to manage the data  
....

I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.

Shaefer answered 26/4, 2019 at 17:7 Comment(8)
I had this problem in the past. Do you know if those proxies are HTTP or HTTPS proxies and whether the server only accepts from a specific type? For me I had the same issue until I learned the server only accepts HTTP proxies but I was feeding it HTTPS proxies. Now my script just runs 24/7Pied
It could be possible. I have just tried with HTTPS and it is even worse because I can't access. With HTTP I get a maximun of 6 requests but HTTPS no one.Verbalism
quick question : What are you trying to scrape that you're getting blocked?Aye
Is 'tranfermarkt', a football web. Finally I tried with HTTPS proxies but from 'hidemyna.me/es/proxy-list/?type=s#list' and it worked. Do you know another free page to get a list?Verbalism
@JavierJiménezdelaJara does using a VPN helps, have you tried scrapy? it might workAye
I used proxybroker (a github package) to get proxies and worked perfectlyVerbalism
great, but i'm still wondering why did the website was blocking your requests after 5 requests?Aye
Hey @JavierJiménezdelaJara i'm pretty sure i'm doing now what you tried here, but i'm doing for 'ogol' another football web. Do you have any contact to share like your Discord or Telegram, so i could get some tips from you and your code? Thanks!Waynewayolle
D
33

I recently had this same issue, but using proxy servers online as recommended in other answers is always risky (from privacy standpoint), slow, or unreliable.

Instead, you can use my requests-ip-rotator Python library to proxy traffic through AWS API Gateway, which gives you a new IP each time: pip install requests-ip-rotator

This can be used as follows (for your site specifically):

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS

gateway = ApiGateway("https://www.transfermarkt.es")
gateway.start()

session = requests.Session()
session.mount("https://www.transfermarkt.es", gateway)

response = session.get("https://www.transfermarkt.es/jadon-sancho/profil/spieler/your_id")
print(response.status_code)

# Only run this line if you are no longer going to run the script, as it takes longer to boot up again next time.
gateway.shutdown() 

Combined with multithreading/multiprocessing, you'll be able to scrape the site in no time.

The AWS free tier provides you with 1 million requests per region, so this option will be free for all reasonable scraping.

Dateline answered 20/7, 2021 at 8:33 Comment(8)
Amazing tool, thanks for putting it together!Park
Thanks! Also, I would like to add, that you need to get your API keys from AWS and add them in this way: gateway = ApiGateway(site="site.com", access_key_id = AWS_ACCESS_KEY_ID, access_key_secret = AWS_SECRET_ACCESS_KEY) You can follow this guide. how to retrieve your keys.Contemporize
Indeed - or optionally if the keys are stored in environment variables then they will be automatically used too, as detailed in this aws guide : )Dateline
FYI datacenter IP addresses are almost always blocked these days so you won't have much luck scraping using this approach.Outbid
Hi @Granitosaurus, I use this method regularly for my company's web scraping and we get a successful scrape on around 80% of otherwise blocked sites. I think it depends largely on the type of site. Hope this helpsDateline
@Dateline It doesn't work for me and you must have access_key_id.Pyosis
@Pyosis as per the package github.com/Ge0rg3/requests-ip-rotator you need to either have the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env vars set, or set them directly in the ApiGateway() constructor. Hope this helpsDateline
With this setup i received Invalid endpoint: https://apigateway..amazonaws.comAny idea how to config ?Carma
U
10
import requests
from itertools import cycle

list_proxy = ['socks5://Username:Password@IP1:20000',
              'socks5://Username:Password@IP2:20000',
              'socks5://Username:Password@IP3:20000',
               'socks5://Username:Password@IP4:20000',
              ]

proxy_cycle = cycle(list_proxy)
# Prime the pump
proxy = next(proxy_cycle)

for i in range(1, 10):
    proxy = next(proxy_cycle)
    print(proxy)
    proxies = {
      "http": proxy,
      "https":proxy
    }
    r = requests.get(url='https://ident.me/', proxies=proxies)
    print(r.text)
Undulate answered 22/10, 2019 at 8:41 Comment(1)
What is the purpose of '# Prime the pump' here ?Laise
A
7

The problem with using free proxies from sites like this is

  1. websites know about these and may block just because you're using one of them

  2. you don't know that other people haven't gotten them blacklisted by doing bad things with them

  3. the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)

Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access

Assimilative answered 26/4, 2019 at 17:28 Comment(0)
O
3

Presumably you have your own pool of proxies - what is the best way to rotate them?

First, blindly picking random proxy we risk of repeating connection from the same proxy multiple times in a row. To add, most connection pattern based blocking is using proxy subnet (3rd number) rather than host - it's best to prevent repeats at subnet level.

It's also a good idea to track proxy performance as not all proxies are equal - we want to use our better performing proxies more often and let dead proxies cooldown.

All of this can be done with weighted randomization which is implemented by Python's random.choices() function:

import random
from time import time
from typing import List, Literal


class Proxy:
    """container for a proxy"""

    def __init__(self, ip, type_="datacenter") -> None:
        self.ip: str = ip
        self.type: Literal["datacenter", "residential"] = type_
        _, _, self.subnet, self.host = ip.split(":")[0].split('.')
        self.status: Literal["alive", "unchecked", "dead"] = "unchecked"
        self.last_used: int = None

    def __repr__(self) -> str:
        return self.ip

    def __str__(self) -> str:
        return self.ip


class Rotator:
    """weighted random proxy rotator"""

    def __init__(self, proxies: List[Proxy]):
        self.proxies = proxies
        self._last_subnet = None

    def weigh_proxy(self, proxy: Proxy):
        weight = 1_000
        if proxy.subnet == self._last_subnet:
            weight -= 500
        if proxy.status == "dead":
            weight -= 500
        if proxy.status == "unchecked":
            weight += 250
        if proxy.type == "residential":
            weight += 250
        if proxy.last_used: 
            _seconds_since_last_use = time() - proxy.last_used
            weight += _seconds_since_last_use
        return weight

    def get(self):
        proxy_weights = [self.weigh_proxy(p) for p in self.proxies]
        proxy = random.choices(
            self.proxies,
            weights=proxy_weights,
            k=1,
        )[0]
        proxy.last_used = time()
        self.last_subnet = proxy.subnet
        return proxy

If we mock run this Rotator we can see how weighted randoms distribute our connections:

from collections import Counter

if __name__ == "__main__":
    proxies = [
        # these will be used more often
        Proxy("xx.xx.121.1", "residential"),
        Proxy("xx.xx.121.2", "residential"),
        Proxy("xx.xx.121.3", "residential"),
        # these will be used less often
        Proxy("xx.xx.122.1"),
        Proxy("xx.xx.122.2"),
        Proxy("xx.xx.123.1"),
        Proxy("xx.xx.123.2"),
    ]
    rotator = Rotator(proxies)

    # let's mock some runs:
    _used = Counter()
    _failed = Counter()
    def mock_scrape():
        proxy = rotator.get()
        _used[proxy.ip] += 1
        if proxy.host == "1":  # simulate proxies with .1 being significantly worse
            _fail_rate = 60
        else:
            _fail_rate = 20
        if random.randint(0, 100) < _fail_rate:  # simulate some failure
            _failed[proxy.ip] += 1
            proxy.status = "dead"
            mock_scrape()
        else:
            proxy.status = "alive"
            return
    for i in range(10_000):
        mock_scrape()

    for proxy, count in _used.most_common():
        print(f"{proxy} was used   {count:>5} times")
        print(f"                failed {_failed[proxy]:>5} times")

# will print:
# xx.xx.121.2 was used    2629 times
#                 failed   522 times
# xx.xx.121.3 was used    2603 times
#                 failed   508 times
# xx.xx.123.2 was used    2321 times
#                 failed   471 times
# xx.xx.122.2 was used    2302 times
#                 failed   433 times
# xx.xx.121.1 was used    1941 times
#                 failed  1187 times
# xx.xx.122.1 was used    1629 times
#                 failed   937 times
# xx.xx.123.1 was used    1572 times
#                 failed   939 times

By using weighted randoms we can create a connection pattern that appears random but smart. We can apply generic patterns like not proxies from the same IP family in a row as well as custom per-target logic like priotizing North American IPs for NA targets etc.

For more on this see my blog How to Rotate Proxies in Web Scraping

Outbid answered 10/9, 2022 at 2:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.