I am querying Freebase to get the genre information for some 10000 movies.
After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr
might be a better alternative to RCurl
.
My questions are:
Is it possible to speed up the API calls by using
a) a parallel version of the loop below (using a WINDOWS machine)?
b) alternatives to getURL such as GET
in the httr
-package?
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
.packages=c("RCurl", "jsonlite")
option to foreach so those packages are loaded by the workers. – Epeirogeny