Speed up API calls in R
Asked Answered
S

1

8

I am querying Freebase to get the genre information for some 10000 movies.

After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.

My questions are: Is it possible to speed up the API calls by using a) a parallel version of the loop below (using a WINDOWS machine)? b) alternatives to getURL such as GET in the httr-package?

library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)

df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)

f_query_freebase <- function(film.title){

  request <- paste0("https://www.googleapis.com/freebase/v1/search?",
                    "filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
                    "&indent=TRUE",
                    "&limit=1",
                    "&output=(/film/film/genre)")

  temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
  data <- fromJSON(temp, simplifyVector=FALSE)
  genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
  return(genre)
}


# Non-parallel version
# ----------------------------------

for (i in df$film){
  df$genre[which(df$film==i)] <- f_query_freebase(i)      
}


# Parallel version - Does not work
# ----------------------------------

# Set up parallel computing
cl<-makeCluster(2) 
registerDoSNOW(cl)

foreach(i=df$film) %dopar% {
  df$genre[which(df$film==i)] <- f_query_freebase(i)     
}

stopCluster(cl)

# --> I get the following error:  "Error in { : task 1 failed", further saying that it cannot find the function "getURL". 
Saltatorial answered 10/4, 2014 at 11:27 Comment(2)
Multi-core is unlikely to speed up web-requests. Read https://mcmap.net/q/1473229/-fast-url-query-with-r/… to use connection pipelining. But be aware that you're hammering someone else's server, so be polite.Ernestineernesto
To get the foreach version to work it looks like you need to add the .packages=c("RCurl", "jsonlite") option to foreach so those packages are loaded by the workers.Epeirogeny
H
2

This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.

At a high level

You'll want to break the process into a few parts:

  1. Get a list of the URLs/API calls you need to make and store as a csv/text file
  2. Use the code below as a template for starting multiple R processes and dividing the work among them

Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.

Powershell/bash script

Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):

e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:

start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }

What's it doing? It will:

  • Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).

The R processes

Each R process can look like this

# Get command line argument 
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])

api_calls <- read.csv("api_calls.csv")

# work out which API calls each R script should make (e.g. 
indicies <- seq(process_number, nrow(api_calls), 3)

api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)

# Now, make API calls as usual using rvest/jsonlite or whatever you use for that
Hans answered 19/2, 2021 at 17:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.