How To Rotate Proxies and IP Addresses using R and rvest
Asked Answered
B

2

5

I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.

I've written a bunch of Sys.sleep(5) and a tryCatch so I'm not blocked too soon.

I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.

I think rotating a proxy should do the job.

Here's my current code (a part of it at least) :

library(rvest)
library(dplyr)

scraped_data = data.frame()

for (i in urlsuffixes$suffix)
  {
  
  tryCatch({
    message("Let's scrape that, Buddy !")
    
    Sys.sleep(5)
 
    doctolib_url = paste0("https://www.website.com/test/", i)

    page = read_html(site_url)
    
    links = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_attr("href")
    
    Sys.sleep(5)
    
    name = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_text()
    
    Sys.sleep(5)
    
    job_title = page %>%
      html_nodes(".seo-directory-doctor-speciality") %>%
      html_text()
    
    Sys.sleep(5)
    
    address = page %>%
      html_nodes(".seo-directory-doctor-address") %>%
      html_text()
    
    Sys.sleep(5)
    
    scraped_data = rbind(scraped_data, data.frame(links,
                                                  name,
                                                  address,
                                                  job_title,
                                                  stringsAsFactors = FALSE))
    
  }, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
  print(paste("Page : ", i))
}
Broadbent answered 7/4, 2021 at 12:24 Comment(0)
L
8

Interesting question. I think the first thing to note is that, as mentioned on this Github issue, rvest and xml2 use httr for the connections. As such, I'm going to introduce httr into this answer.

Using a proxy with httr

The following code chunk shows how to use httr to query a url using a proxy and extract the html content.

page <- httr::content(
    httr::GET(
        url, 
        httr::use_proxy(ip, port, username, password)
    )
)

If you are using IP authentication or don't need a username and password, you can simply exclude those values from the call.

In short, you can replace the page = read_html(site_url) with the code chunk above.

Rotating the Proxies

One big problem with using proxies is getting reliable ones. For this, I'm just going to assume that you have a reliable source. Since you haven't indicated otherwise, I'm going to assume that your proxies are stored in the following reasonable format with object name proxies:

ip port
64.235.204.107 8080
167.71.190.253 80
185.156.172.122 3128

With that format in mind, you could tweak the script chunk above to rotate proxies for every web request as follows:

library(dplyr)
library(httr)
library(rvest)

scraped_data = data.frame()

for (i in 1:length(urlsuffixes$suffix))
  {
  
  tryCatch({
    message("Let's scrape that, Buddy !")
    
    Sys.sleep(5)
 
    doctolib_url = paste0("https://www.website.com/test/", 
                          urlsuffixes$suffix[[i]])
   
   # The number of urls is longer than the proxy list -- which proxy to use
   # I know this isn't the greatest, but it works so whatever
   proxy_id <- ifelse(i %% nrow(proxies) == 0, nrow(proxies), i %% nrow(proxies))

    page <- httr::content(
        httr::GET(
            doctolib_url, 
            httr::use_proxy(proxies$ip[[proxy_id]], proxies$port[[proxy_id]])
        )
    )
    
    links = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_attr("href")
    
    Sys.sleep(5)
    
    name = page %>%
      html_nodes(".seo-directory-doctor-link") %>%
      html_text()
    
    Sys.sleep(5)
    
    job_title = page %>%
      html_nodes(".seo-directory-doctor-speciality") %>%
      html_text()
    
    Sys.sleep(5)
    
    address = page %>%
      html_nodes(".seo-directory-doctor-address") %>%
      html_text()
    
    Sys.sleep(5)
    
    scraped_data = rbind(scraped_data, data.frame(links,
                                                  name,
                                                  address,
                                                  job_title,
                                                  stringsAsFactors = FALSE))
    
  }, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
  print(paste("Page : ", i))
}

This may not be enough

You might want to go a few steps further and add elements to the httr request such as the user-agent etc. However, one of the big problems with a package like httr is that it can't render dynamic html content, such as JavaScript-rendered html, and any website that really cares about blocking scrapers is going to detect this. To conquer this problem there are tools such as Headless Chrome that are meant to address specifically stuff like this. Here's a package you might want to look into for headless Chrome in R NOTE: still in development.

Disclaimer

Obviously, I think this code will work but since there's no reproducible data to test with, it may not.

Lunde answered 7/4, 2021 at 15:25 Comment(0)
D
1

As already said by @Daniel-Molitor headless Chrome gives stunning results. Another cheap option in R Studio is looping over a list of proxies while you have to start a new R process afterwards

Sys.setenv(http_proxy=proxy)
.rs.restartR()

Sys.sleep(1) can be even omitted afterwards ;-)

Ducky answered 20/7, 2022 at 9:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.