Run Selenium parallel test on Azure batch
Asked Answered
G

1

22

I am using latest version of R on windows 7.

I would like to run many test in parallel using RSelenium so, my question is:

  • What is the recommended way to run many RSelenium tests?

Let's say I would like to run 1000 tests and each step takes 1 hour. Running tests one by one takes lot's of time (24 test per day, so in total cca 42 days). I know how to use doParallel and foreach package to run tests in parallel on my machine: Run RSelenium in parallel, but sometimes, this is not enough. I would like like to run around 100 tests in parallel. I tried to use Azure Batch for that, but get lot's of errors on some nodes when starting the selenium server.

More concretely, I have written dockerfile:

FROM rocker/r-base:latest 

RUN  apt-get update \
  && apt-get install -y --no-install-recommends \
   libxml2-dev \
   libcurl4-openssl-dev \
   libssl-dev \
   gnupg2 \
   libfftw3-dev \
   libtiff-dev \
   libx11-dev \
   libcairo2-dev \
   libxt-dev \
   firefox

#RUN add-apt-repository -y ppa:mozillateam/firefox-next

## Install Java 
RUN echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \ 
        | tee /etc/apt/sources.list.d/webupd8team-java.list \ 
    && echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \ 
        | tee -a /etc/apt/sources.list.d/webupd8team-java.list \ 
    && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 \ 
    && echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" \ 
        | /usr/bin/debconf-set-selections \ 
    && apt-get update \ 
    && apt-get install -y oracle-java8-installer \ 
    && update-alternatives --display java \ 
    && rm -rf /var/lib/apt/lists/* \ 
    && apt-get clean \ 
    && R CMD javareconf 

## make sure Java can be found in rApache and other daemons not looking in R ldpaths 
RUN echo "/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/" > /etc/ld.so.conf.d/rJava.conf 
RUN /sbin/ldconfig

# Install the R Packages from CRAN
RUN Rscript -e 'install.packages(c("Cairo", "Rcpp", "RSelenium", "httr", "rvest", "imager", "RCurl"))'

I have used doAzureParallel package to execute many scripts in parallel:

# prepare Azure batch
setwd("E:/data/R/web_scraping/zk_ba/azure")
library(doAzureParallel) 
setVerbose(TRUE)
setAutoDeleteJob(FALSE)
generateCredentialsConfig("credentials.json") 
setCredentials("credentials.json")
generateClusterConfig("cluster.json")
cluster <- makeCluster("cluster.json") 
registerDoAzureParallel(cluster) 
getDoParWorkers()
opt <- list(wait = FALSE) 

jobId <- foreach(
  i = 1:n_cluster, 
  # .packages = c("RSelenium", "imager", "httr", "RCurl", "rvest"),
  # .combine = 'rbind',
  .errorhandling = "pass",
  .options.azure = opt, 
  .export = c("metadata", "first_step", "parcele_df", "vlasnici_df", "status_teret_df", "n_cluster")
) %dopar% { 

  library(RSelenium)
  library(imager)
  library(httr)
  library(RCurl)
  library(rvest)

  #-----------------------------------#
  #    START SELENIUM AND PREPARE     #
  #-----------------------------------#

  if (first_step == TRUE) {
    tryCatch({
      rD <<- RSelenium::rsDriver(
        browser = "firefox",
        extraCapabilities = list(
          "moz:firefoxOptions" = list(
            args = list('--headless')
          )
        )
      )
    }, error = function(e) NA)
    driver <<- rD$client
    driver$open()
    driver$navigate("http://www.e-grunt.ba/")
    Sys.sleep(3L)
..
}

but this return error on many nodes:

<simpleError in checkError(res): Undefined error in httr call. httr output: Failed to connect to localhost port 4567: Connection refused>

What would be general advice in situations where we need to use RSelenium in lot's of parallel tests?

Grouping answered 26/11, 2018 at 11:4 Comment(2)
But I think I have to start driver on VM,, not on every node, and I am using 4 VM's and 4 nodes. I don't know why same port would be a problem if VM's are independent from on to another. I have also tried to run Selenium session in parallel o lokal port and I called rsDriver function only once. All other nodes successfully listened this driver on one port.Grouping
are you trying to run your case on Azure DevOps pipelines?Aplite
O
1

RSelenium connects to the Selenium server it sets up on port 4567 by default. As soon as one of the parallel nodes connects to the server via this port, no other node can connect through this port.

A solution is to add the following argument to the rsDriver in the foreach loop:

rD <<- RSelenium::rsDriver(
        port = 4567L + as.integer(i),
        browser = "firefox",
        extraCapabilities = list(
          "moz:firefoxOptions" = list(
            args = list('--headless')
          )
        )

You may have to check for clashes of the ports with other applications.

Og answered 15/5, 2020 at 19:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.