How to delay an action until the webpage loads
Asked Answered
P

1

3

I am using Selenium within R.

I have the following script which searches Google Maps for all pizza restaurants around a given geographical coordinate - and then keeps scrolling until all restaurants are loaded.

First, I navigate to the starting page:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object <- selenium(retcommand = T, check = F)

remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr<- remote_driver$client

lat <- 40.7484
lon <- -73.9857

# Create the URL using the paste function
URL <- paste0("https://www.google.com/maps/search/pizza/@", lat, ",", lon, ",17z/data=!3m1!4b1!4m6!2m5!3m4!2s", lat, ",", lon, "!4m2!1d", lon, "!2d", lat, "?entry=ttu")

# Navigate to the URL
remDr$navigate(URL)

Then, I use the following code to keep scrolling until all entries have been loaded:

# Waits 10 seconds for the elements to load before scrolling
elements <- remDr$findElements(using = "css selector", "div.qjESne")

while (TRUE) {
    new_elements <- remDr$findElements(using = "css selector", "div.qjESne")

    # Pick the last element in the list - this is the one we want to scroll to
    last_element <- elements[[length(elements)]]
    # Scroll to the last element
    remDr$executeScript("arguments[0].scrollIntoView(true);", list(last_element))
    Sys.sleep(10)

    # Update the elements list
    elements <- new_elements

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if (length(remDr$findElements(using = "css selector", "span.HlvSq")) > 0) {
        print("No more elements")
        break
    }
}

Finally, I use this code to extract the names and addresses of all restaurants:

titles <- c()
addresses <- c()

# Check if there are any new elements loaded - the "You've reached the end of the list." message
if (length(remDr$findElements(using = "css selector", "span.HlvSq")) > 0) {
    # now we can parse the data since all the elements loaded
    for (data in remDr$findElements(using = "css selector", "div.lI9IFe")) {
        title <- data$findElement(using = "css selector", "div.qBF1Pd.fontHeadlineSmall")$getElementText()[[1]]
        restaurant <- data$findElement(using = "css selector", ".W4Efsd > span:nth-of-type(2)")$getElementText()[[1]]

        titles <- c(titles, title)
        addresses <- c(addresses, restaurant)
    }

    # This converts the list of titles and addresses into a dataframe
    df <- data.frame(title = titles, address = addresses)
    print(df)
    break
}

Instead of using Sys.sleep() in R, I am trying to change my code such that only scrolls (i.e., delays the action) once the previous action has been completed. I am noticing that my existing code often freezes half way through and I suspect that this is because I am trying to load a new page when the existing page is not fully loaded. I think it might be better to somehow delay the action and wait for the page to be fully loaded prior to proceeding.

How might I be able to delay my script and force it to wait for the existing page to load before loading a new page? (e.g., R - Waiting for page to load in RSelenium with PhantomJS)

Note: I am also open to a Python solution.

References:

Peregrinate answered 12/8, 2023 at 21:14 Comment(8)
Can you not use the wait function from selenium?Premedical
@Hermann12: thank you for your reply! Do you think you can please show me how to use this function here?Peregrinate
You can still find a lot of examples herePremedical
Thanks! I saw this link before - i am trying to learn: how to modify these examples for the R programming languagePeregrinate
Explicit waits seem like it will help - I don't know about R, but there are examples for Python: explicit-waitsTello
you can use the waitUntil function to wait for a specific condition to be met before proceeding.Hepato
The bounty attracted at least one ChatGPT plagiariser.Voltaism
You seem to be scraping data from google. Did you try looking at Google places API instead?Recess
H
2
library(RSelenium)
library(wdman)

# Initialize Selenium driver
driver <- rsDriver(browser = "chrome", verbose = FALSE, port = free_port())
remDr <- driver$client

lat <- 40.7484
lon <- -73.9857

# Create the URL
URL <- paste0("https://www.google.com/maps/search/pizza/@", lat, ",", lon, ",17z/data=!3m1!4b1!4m6!2m5!3m4!2s", lat, ",", lon, "!4m2!1d", lon, "!2d", lat, "?entry=ttu")

# Navigate to the URL
remDr$navigate(URL)

# Wait until the page is fully loaded
remDr$wait(timeout = 10, condition = function(d) {
d$executeScript("return document.readyState === 'complete';")
})

# Your scrolling and data extraction code here

# Close the driver
remDr$close()

The remDr$wait function waits until the document.readyState becomes 'complete', indicating that the page has finished loading. Once the condition is met, you can proceed with your scrolling and data extraction code.

Using remDr$wait with the condition to wait for the page to be fully loaded is a more reliable approach than using Sys.sleep because it ensures that your script waits until the page is actually ready for interaction.

Hepato answered 15/8, 2023 at 18:58 Comment(5)
@Raj Hassan: thank you for your answer! Can you please explain what is the 10 doing in your code?Peregrinate
Thnks for noticing , I missed the timeout attribute here. The remDr$wait function takes two main arguments: the timeout duration and a condition function. The timeout duration specifies how many seconds RSelenium will wait for the condition to be met before timing out and continuing the script.Hepato
@Raj Hassan: thank you for your reply! I am still trying to understand this: if you wait 10 seconds for a timeout but the page loads in 4 seconds - this means you saved 6 seconds. But if the page takes more than 10 seconds to load, you skip to the next action... correct?Peregrinate
Let me explain for you. If the condition is met within the timeout duration (e.g., the page loads in 4 seconds but you set a timeout of 10 seconds): RSelenium will not wait for the full 10 seconds; it will proceed as soon as the condition is met. If the condition is not met within the timeout duration (e.g., the page takes longer than 10 seconds to load): RSelenium will wait for the full timeout duration of 10 seconds. After the timeout, if the condition is still not met, RSelenium will raise an error or proceed with the next action.Hepato
@ Raj Hassan: thank you so much for your explanation! Do you think if you have time, can you please take my full code and show how you can insert your logic into my full code? Thank you so much!Peregrinate

© 2022 - 2024 — McMap. All rights reserved.