Issues with RCurl crawler based on concurrent requests
Asked Answered
E

1

6

The following is a script to reproduce the problems i'm facing when building a crawler with RCurl that performs concurrent requests. The objective is to download the content of several thousands of web sites in order to perform statistical analysis. Therefore, the solution should scale.

library(RCurl)
library(httr)

uris = c("inforapido.com.ar", "lm.facebook.com", "promoswap.enterfactory.com", 
         "p.brilig.com", "wap.renxo.com", "alamaula.com", "syndication.exoclick.com", 
         "mcp-latam.zed.com", "startappexchange.com", "fonts.googleapis.com", 
         "xnxx.com", "wv.inner-active.mobi", "canchallena.lanacion.com.ar", 
         "android.ole.com.ar", "livefyre.com", "fbapp://256002347743983/thread")

### RCurl Concurrent requests 

getURIs <- function(uris, ..., multiHandle = getCurlMultiHandle(), .perform = TRUE){
  content = list()
  curls = list()
  for(i in uris) {
    curl = getCurlHandle()
    content[[i]] = basicTextGatherer()
    opts = curlOptions(URL = i, writefunction = content[[i]]$update,
                       timeout = 2, maxredirs = 3, verbose = TRUE,
                       followLocation = TRUE,...)
    curlSetOpt(.opts = opts, curl = curl)
    multiHandle = push(multiHandle, curl)
  }
  if(.perform) {
    complete(multiHandle)
    lapply(content, function(x) x$value())
  } else {
    return(list(multiHandle = multiHandle, content = content))
  }
}

### Split uris in 3
uris_ls = split(uris, 1:3)

### retrieve content 
uris_content <- list()
for(i in seq_along(uris_ls)){
  uris_content[[i]] <- getURIs(uris_ls[[i]])
}

library(plyr)
a = lapply(uris_content, function(x) ldply(x, rbind))
result = ldply(a, rbind)
names(result) <- c('url', 'content')
result$number_char <- nchar(as.character(result$content))

### Here are examples of url that aren't working
url_not_working = result[result$number_char == 0, 1]

# url_not_working
# [1] "inforapido.com.ar"              "canchallena.lanacion.com.ar"    "fbapp://256002347743983/thread"
# [4] "xnxx.com"                       "startappexchange.com"           "wv.inner-active.mobi"          
# [7] "livefyre.com"   

### Using httr GET it works fine

get_httr = GET(url_not_working[2])
content(g, 'text')

# The result is the same when using a single call
get_rcurl = getURL(url_not_working[2], encoding='UTF-8', timeout = 2, 
                   maxredirs = 3, verbose = TRUE,
                   followLocation = TRUE)
get_rcurl

Question:

Given the amount of web pages i need to crawl, i would rather use RCurl for this as it supports concurrent requests. I wonder if it is posible to improve the getURLs() call in order to make it work as the GET() version in the cases where the getURL/getURLs version fails.

UPDATE:

I've added a gist with more data (990 uris) to better reproduce the problem.

uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

After running:

uris_content <- list()
for(i in seq_along(uris_ls)){
  uris_content[[i]] <- getURIs(uris_ls[[i]])
}

I get the following error:

Error in curlMultiPerform(obj) : embedded nul in string: 'GIF89a\001'
In addition: Warning message:
In strsplit(str, "\\\r\\\n") : input string 1 is invalid in this locale

Using getURIAsynchronous:

  uris_content <- list()
  for(i in seq_along(uris_ls)){
    uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], 
         .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
         followLocation = TRUE))
  }

I get a similar error: Error in nchar(str) : invalid multibyte string 1

UPDATE 2

library(RCurl)
uris_ls <- dput() # dput() output found here: https://gist.github.com/martinbel/b4cc730b32914475ef0b

After trying the following:

Sys.setlocale(locale="C")
uris_content <- list()
for(i in seq_along(uris_ls)){
    uris_content[[i]] <- getURIAsynchronous(uris_ls[[i]], 
       .opts=list(timeout = 2, maxredirs = 3, verbose = TRUE,
       followLocation = TRUE))
}

The result is that it works well for the first 225 URLs, then it just returns cero content from the web site. Is this the null error issue?

# This is a quick way to inspect the output:
nc = lapply(uris_content, nchar)
nc[[5]]
 [1]  51422      0     16  19165 111763      6  14041    202   2485      0
[11]  78538      0      0      0 133253  42978      0      0   7880  33336
[21]   6762    194     93      0      0      0      0      0      9      0
[31] 165974  13222  22605   1392      0  42932   1421      0      0      0
[41]      0  13760    289      0   2674

nc[[6]]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 0 0 0 0 0 0
Endopeptidase answered 28/9, 2014 at 22:29 Comment(5)
I am not sure , but maybe the answer here can help.Indelicate
It seems the results are similar using getURIs and getURIAsynchronous, probably the later is a better version. I'll add a dput of my data so it's really reproducibleEndopeptidase
After running a few more test i found that both versions are crushing, for different reasons. With the dput() dataset it's posible to reproduce the error that only appears after processing a few hundred urls.Endopeptidase
personally I would not use R for this stuff. For hundreds of urls you should use a specialized scraping framework. I used Scrapy[scrapy.org). It is in python, but really easy to use and learn.Indelicate
Take a look here: #25090725 for a possible solution to the "nul" problem. For the locale issue, you can do Sys.setlocale(locale="C").Knudson
E
1

As nobody answered, I propose a temporary solution. If getURIAsynchronous doesn't work, just download sequentially using httr::GET and httr::content that doesn't have the null string issue.

library(RCurl)
library(httr)

Sys.setlocale(locale="C")

opts = list(timeout = 2, maxredirs = 3, 
            verbose = TRUE, followLocation = TRUE)

try_asynch <- function(uris, .opts=opts){
  getURIAsynchronous(uris, .opts=opts)
}

get_content <- function(uris){
  cont <- try_asynch(uris)
  nc <- lapply(content, nchar)
  nc <- sapply(nc, function(x) ifelse(sum(x > 0), 1, 0))
  if(sum(nc) < 10){
    r <- lapply(uris, function(x) GET(x))
    cont <- lapply(r, function(x) content(x, 'text'))
  }
  cont
}

docs <- lapply(uris_ls, get_content)
Endopeptidase answered 9/10, 2014 at 7:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.