The following is a script to reproduce the problems i'm facing when building a crawler with RCurl that performs concurrent requests. The objective is to download the content of several thousands of web sites in order to perform statistical analysis. Therefore, the solution should scale.
library(RCurl)
library(httr)
uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com",
"p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com",
"mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com",
"xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar",
"android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")
### RCurl Concurrent requests
getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
content = list()
curls = list()
for(i in uris) {
curl = getCurlHandle()
content[[i]] = basicTextGatherer()
opts = curlOptions(URL = i, writefunction = content[[i]]$update,
timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE,...)
curlSetOpt(.opts = opts, curl = curl)
multiHandle = push(multiHandle, curl)
}
if(.perform) {
complete(multiHandle)
lapply(content, function(x) x$value())
} else {
return(list(multiHandle = multiHandle, content = content))
}
}
### Split uris in 3
uris_ls = split(uris, 1:3)
### retrieve content
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIs(uris_ls[[i]])
}
library(plyr)
a = lapply(uris_content, function(x) ldply(x, rbind))
result = ldply(a, rbind)
names(result) <- c('url', 'content')
result$number_char <- nchar(as.character(result$content))
### Here are examples of url that aren't working
url_not_working = result[result$number_char == 0, 1]
# url_not_working
# [1] "inforapido.com.ar" "canchallena.lanacion.com.ar" "fbapp://256002347743983/thread"
# [4] "xnxx.com" "startappexchange.com" "wv.inner-active.mobi"
# [7] "livefyre.com"
### Using httr GET it works fine
get_httr = GET(url_not_working[2])
content(g, 'text')
# The result is the same when using a single call
get_rcurl = getURL(url_not_working[2], encoding='UTF-8', timeout = 2,
maxredirs = 3, verbose = TRUE,
followLocation = TRUE)
get_rcurl
Question:
Given the amount of web pages i need to crawl, i would rather use RCurl for this as it supports concurrent requests. I wonder if it is posible to improve the getURLs() call in order to make it work as the GET() version in the cases where the getURL/getURLs version fails.
UPDATE:
I've added a gist with more data (990 uris) to better reproduce the problem.
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b
After running:
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIs(uris_ls[[i]])
}
I get the following error:
Error in curlMultiPerform(obj) : embedded nul in string: 'GIF89a\001'
In addition: Warning message:
In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale
Using getURIAsynchronous:
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]],
.opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE))
}
I get a similar error: Error in nchar(str) : invalid multibyte string 1
UPDATE 2
library(RCurl)
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b
After trying the following:
Sys.setlocale(locale="C")
uris_content <- list()
for(i in seq_along(uris_ls)){
uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]],
.opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
followLocation = TRUE))
}
The result is that it works well for the first 225 URLs, then it just returns cero content from the web site. Is this the null error issue?
# This is a quick way to inspect the output:
nc = lapply(uris_content, nchar)
nc[[5]]
[1] 51422 0 16 19165 111763 6 14041 202 2485 0
[11] 78538 0 0 0 133253 42978 0 0 7880 33336
[21] 6762 194 93 0 0 0 0 0 9 0
[31] 165974 13222 22605 1392 0 42932 1421 0 0 0
[41] 0 13760 289 0 2674
nc[[6]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0
getURIAsynchronous
, probably the later is a better version. I'll add a dput of my data so it's really reproducible – EndopeptidaseSys.setlocale(locale="C")
. – Knudson