I'm doing some scraping, but as I'm parsing approximately 4000 URL's, the website eventually detects my IP and blocks me every 20 iterations.
I've written a bunch of Sys.sleep(5)
and a tryCatch
so I'm not blocked too soon.
I use a VPN but I have to manually disconnect and reconnect it every now and then to change my IP. That's not a suitable solution with such a scraper supposed to run all night long.
I think rotating a proxy should do the job.
Here's my current code (a part of it at least) :
library(rvest)
library(dplyr)
scraped_data = data.frame()
for (i in urlsuffixes$suffix)
{
tryCatch({
message("Let's scrape that, Buddy !")
Sys.sleep(5)
doctolib_url = paste0("https://www.website.com/test/", i)
page = read_html(site_url)
links = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_attr("href")
Sys.sleep(5)
name = page %>%
html_nodes(".seo-directory-doctor-link") %>%
html_text()
Sys.sleep(5)
job_title = page %>%
html_nodes(".seo-directory-doctor-speciality") %>%
html_text()
Sys.sleep(5)
address = page %>%
html_nodes(".seo-directory-doctor-address") %>%
html_text()
Sys.sleep(5)
scraped_data = rbind(scraped_data, data.frame(links,
name,
address,
job_title,
stringsAsFactors = FALSE))
}, error=function(e){cat("Houston, we have a problem !","\n",conditionMessage(e),"\n")})
print(paste("Page : ", i))
}