R - Waiting for page to load in RSelenium with PhantomJS
Asked Answered
S

2

11

I put together a crude scraper that scrapes prices/airlines from Expedia:

# Start the Server
rD <- rsDriver(browser = "phantomjs", verbose = FALSE)

# Assign the client
remDr <- rD$client

# Establish a wait for an element
remDr$setImplicitWaitTimeout(1000)

# Navigate to Expedia.com
appurl <- "https://www.expedia.com/Flights-Search?flight-type=on&starDate=04/30/2017&mode=search&trip=oneway&leg1=from:Denver,+Colorado,to:Oslo,+Norway,departure:04/30/2017TANYT&passengers=children:0,adults:1"
remDr$navigate(appURL)

# Give a crawl delay to see if it gives time to load web page
Sys.sleep(10)   # Been testing with 10

###ADD JAVASCRIPT INJECTION HERE###
remDr$executeScript(?)

# Extract Prices
webElem <- remDr$findElements(using = "css", "[class='dollars price-emphasis']")
prices <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(prices)

# Extract Airlines
webElem <- remDr$findElements(using = "css", "[data-test-id='airline-name']")
airlines <- unlist(lapply(webElem, function(x){x$getElementText()}))
print(airlines)

# close client/server
remDr$close()
rD$server$stop()

As you can see, I built in an ImplicitWaitTimeout and a Sys.Sleep call so that the page has time to load in phantomJS and to not overload the website with requests.

Generally speaking, when looping over a date range, the scraper works well. However, when looping through 10+ dates consecutively, Selenium sometimes throws a StaleElementReference error and stops the execution. I know the reason for this is because the page has yet to load completely and the class='dollars price-emphasis' doesn't exist yet. The URL construction is fine.

Whenever the page successfully loads all the way, the scraper gets near 60 prices and flights. I'm mentioning this because there are times when the script returns only 15-20 entries (when checking this date normally on a browser, there are 60). Here, I conclude that I'm only finding 20 of 60 elements, meaning the page has only partially loaded.

I want to make this script more robust by injecting JavaScript that waits for the page to fully load prior to looking for elements. I know the way to do this is remDr$executeScript(), and I have found many useful JS snippets for accomplishing this, but due to limited knowledge in JS, I'm having problems adapting these solutions to work syntactically with my script.

Here are several solutions that have been proposed from Wait for page load in Selenium & Selenium - How to wait until page is completely loaded:

Base Code:

remDr$executeScript(
WebDriverWait wait = new WebDriverWait(driver, 20);
By addItem = By.cssSelector("class=dollars price-emphasis");, args = list()
)

Additions to base script:

1) Check for Staleness of an Element

# get the "Add Item" element
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(addItem));
# wait the element "Add Item" to become stale
wait.until(ExpectedConditions.stalenessOf(element));

2) Wait for Visibility of element

wait.until(ExpectedConditions.visibilityOfElementLocated(addItem));

I have tried to use remDr$executeScript("return document.readyState").equals("complete") as a check before proceeding with the scrape, but the page always shows as complete, even if it's not.

Does anyone have any suggestions about how I could adapt one of these solutions to work with my R script? Any ideas on how I could wait entirely for the page to load with nearly 60 found elements? I'm still leaning, so any help would be greatly appreciated.

Sharonsharona answered 13/4, 2017 at 21:46 Comment(3)
I usually just use while/if/control flow to see if some part of the page exists yet, though it's a good idea to put a timeout in case something else fails. There may be a more elegant solution, though.Velocipede
@Velocipede This was great advice and ultimately solved the problem. It was such a simple solution! I'll post my answer while/if loop.Sharonsharona
@IRNotSmart, can you post your solution with while/if loop please? I think it might help me.Kattegat
A
15

Solution using while/tryCatch:

remDr$navigate("<webpage url>")
webElem <-NULL
while(is.null(webElem)){
  webElem <- tryCatch({remDr$findElement(using = 'name', value = "<value>")},
  error = function(e){NULL})
 #loop until element with name <value> is found in <webpage url>
}
Amoretto answered 28/6, 2018 at 17:42 Comment(1)
great answer! Is it possible to apply your answer to this question? #76888200Win
M
4

To tack on a bit more convenience to Victor's great answer, a common element on tons of pages is body which can be accessed via css. I also made it a function and added a quick random sleep (always good practice). This should work without you needing to assign the element on most web pages with text:

##use double arrow to assign to global environment permanently
#remDr <<- remDr
wetest <- function(sleepmin,sleepmax){
  remDr <- get("remDr",envir=globalenv())
  webElemtest <-NULL
  while(is.null(webElemtest)){
    webElemtest <- tryCatch({remDr$findElement(using = 'css', "body")},
                            error = function(e){NULL})
    #loop until element with name <value> is found in <webpage url>
  }
  randsleep <- sample(seq(sleepmin, sleepmax, by = 0.001), 1)
  Sys.sleep(randsleep)
}

Usage:

remDr$navigate("https://bbc.com/news")
clickable <- remDr$findElements(using='xpath','//button[contains(@href,"")]')
clickable[[1]]$clickElement()
wetest(sleepmin=.5,sleepmax=1)
Maternity answered 29/1, 2021 at 23:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.