Speed up API calls in R

library(RCurl) library(jsonlite) library(foreach) library(doSNOW) df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA) f_query_freebase <- function(film.title){ request <- paste0("https://www.googleapis.com/freebase/v1/search?", "filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"), "&indent=TRUE", "&limit=1", "&output=(/film/film/genre)") temp <- getURL(URLencode(request), ssl.verifypeer = FALSE) data <- fromJSON(temp, simplifyVector=FALSE) genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ") return(genre) } # Non-parallel version # ---------------------------------- for (i in df$film){ df$genre[which(df$film==i)] <- f_query_freebase(i) } # Parallel version - Does not work # ---------------------------------- # Set up parallel computing cl<-makeCluster(2) registerDoSNOW(cl) foreach(i=df$film) %dopar% { df$genre[which(df$film==i)] <- f_query_freebase(i) } stopCluster(cl) # --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".

This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.

At a high level

You'll want to break the process into a few parts:

Get a list of the URLs/API calls you need to make and store as a csv/text file
Use the code below as a template for starting multiple R processes and dividing the work among them

Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.

Powershell/bash script

Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):

e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:

start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }

What's it doing? It will:

Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).

The R processes

Each R process can look like this

# Get command line argument 
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])

api_calls <- read.csv("api_calls.csv")

# work out which API calls each R script should make (e.g. 
indicies <- seq(process_number, nrow(api_calls), 3)

api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)

# Now, make API calls as usual using rvest/jsonlite or whatever you use for that

At a high level

Powershell/bash script

The R processes

Recommended topics

Hot tags