HTML/XML: Understanding How "Scroll Bars" Work
Asked Answered
G

2

7

I am working with the R programming language and trying to learn about how to use Selenium to interact with webpages.

For example, using Google Maps - I am trying to find the name, address and longitude/latitude of all Pizza shops around a certain area. As I understand, this would involve entering the location you are interested in, clicking the "nearby" button, entering what you are looking for (e.g. "pizza"), scrolling all the way to the bottom to make sure all pizza shops are loaded - and then copying the names, address and longitude/latitudes of all pizza locations.

I have been self-teaching myself how to use Selenium in R and have been able to solve parts of this problem myself. Here is what I have done so far:

Part 1: Searching for an address (e.g. Statue of Liberty, New York, USA) and returning a longitude/latitude :

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object <- selenium(retcommand = T, check = F)


remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr<- remote_driver$client
remDr$navigate("https://www.google.com/maps")

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("Statue of Liberty", key = "enter"))

Sys.sleep(5)

url <- remDr$getCurrentUrl()[[1]]

long_lat <- gsub(".*@(-?[0-9.]+),(-?[0-9.]+),.*", "\\1,\\2", url)
long_lat <- unlist(strsplit(long_lat, ","))

> long_lat
[1] "40.7269409"  "-74.0906116"

Part 2: Searching for all Pizza shops around a certain location:

library(RSelenium)
library(wdman)
library(netstat)

selenium()
seleium_object <- selenium(retcommand = T, check = F)

remote_driver <- rsDriver(browser = "chrome", chromever = "114.0.5735.90", verbose = F, port = free_port())

remDr<- remote_driver$client


remDr$navigate("https://www.google.com/maps")


Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))

Sys.sleep(5)


search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))


Sys.sleep(5)

But from here, I do not know how to proceed. I do not know how to scroll the page all the way to the bottom to view all such results that are available - and I do not know how to start extracting the names.

Doing some research (i.e. inspecting the HTML code), I made the following observations:

  • The name of a restaurant location can be found in the following tags: <a class="hfpxzc" aria-label=

  • The address of a restaurant location be found in the following tags: <div class="W4Efsd">

In the end, I would be looking for a result like this:

        name                            address longitude latitude
1 pizza land 123 fake st, city, state, zip code    45.212  -75.123

Can someone please show me how to proceed?

Note: Seeing as more people likely use Selenium through Python - I am more than happy to learn how to solve this problem in Python and then try to convert the answer into R code.r

Thanks!

References:

UPDATE: Some further progress with addresses

remDr$navigate("https://www.google.com/maps")

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$sendKeysToElement(list("40.7256456,-74.0909442", key = "enter"))

Sys.sleep(5)

search_box <- remDr$findElement(using = 'css selector', "#searchboxinput")
search_box$clearElement()
search_box$sendKeysToElement(list("pizza", key = "enter"))

Sys.sleep(5)

address_elements <- remDr$findElements(using = 'css selector', '.W4Efsd')
addresses <- lapply(address_elements, function(x) x$getElementText()[[1]])

result <- data.frame(name = unlist(names), address = unlist(addresses))
Garboard answered 17/7, 2023 at 3:39 Comment(1)
I haven't used Selenium before, so I dunno! I could start looking into it, but it would take a whileCzerny
C
7

I see that you updated your question to include a Python answer, so here's how it's done in Python. you can use the same method for R.

The page is lazy loaded which means, as you scroll the data is paginated and loaded.

So, what you need to do, is to keep finding the last HTML tag of the data which will therefore load more content.

Finding how more data is loaded

You need to find out how the data is loaded. Here's what I did:

First, disable internet access for your browser in the Network calls (F12 -> Network -> Offline)

enter image description here

Then, scroll to the last loaded element, you will see a loading indicator (since there is no internet, it will just hang)

enter image description here

Now, here comes the important part, find out under what HTML tag this loading indicator is:

enter image description here

As you can see that element is under the div.qjESne CSS selector.

Working with Selenium

You can call the javascript code scrollIntoView() function which will scroll a particular element into view within the browser's viewport.

Finding out when to break

To find out when to stop scrolling in order to load more data, we need to find out what element appears when theres no data.

If you scroll until there are no more results, you will see:

enter image description here

which is an element under the CSS selector span.HlvSq.

Code examples

Scrolling the page
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"

driver = webdriver.Chrome()


driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script("arguments[0].scrollIntoView(true);", last_element)

    # Update the elements list
    elements = new_elements

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
        print("No more elements")
        break
Getting the data

If you inspect the page, you will see that the data is under the cards under the CSS selector of div.lI9IFe.

What you need to do, is wait until the scrolling has finished, and then you get all the data under the CSS selector of div.lI9IFe

Code example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

URL = "https://www.google.com/maps/search/Restaurants/@40.7256843,-74.1138399,14z/data=!4m8!2m7!3m5!1sRestaurants!2s40.7256456,-74.0909442!4m2!1d-74.0909442!2d40.7256456!6e5?entry=ttu"

driver = webdriver.Chrome()
driver.get(URL)

# Waits 10 seconds for the elements to load before scrolling
wait = WebDriverWait(driver, 10)
elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
)
titles = []
links = []
addresses = []

while True:
    new_elements = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.qjESne"))
    )

    # Pick the last element in the list - this is the one we want to scroll to
    last_element = elements[-1]
    # Scroll to the last element
    driver.execute_script("arguments[0].scrollIntoView(true);", last_element)

    # Update the elements list

    elements = new_elements
    # time.sleep(0.1)

    # Check if there are any new elements loaded - the "You've reached the end of the list." message
    if driver.find_elements(By.CSS_SELECTOR, "span.HlvSq"):
        # now we can parse the data since all the elements loaded
        for data in driver.find_elements(By.CSS_SELECTOR, "div.lI9IFe"):
            title = data.find_element(
                By.CSS_SELECTOR, "div.qBF1Pd.fontHeadlineSmall"
            ).text
            restaurant = data.find_element(
                By.CSS_SELECTOR, ".W4Efsd > span:nth-of-type(2)"
            ).text

            titles.append(title)
            addresses.append(restaurant)

        # This converts the list of titles and links into a dataframe
        df = pd.DataFrame(list(zip(titles, addresses)), columns=["title", "addresses"])
        print(df)
        break

Prints:

                            title               addresses
0                   Domino's Pizza  · 741 Communipaw Ave A
1        Tommy's Family Restaurant       · 349 Central Ave
2     VIP RESTAURANT LLC BARSHAY'S           · 175 Sip Ave
3    The Hutton Restaurant and Bar         · 225 Hutton St
4                        Barge Inn            · 324 3rd St
..                             ...                     ...
116            Bettie's Restaurant     · 579 West Side Ave
117               Mahboob-E-El Ahi     · 580 Montgomery St
118                Samosa Paradise        · 804 Newark Ave
119                     TACO DRIVE        · 195 Newark Ave
120                Two Boots Pizza        · 133 Newark Ave

[121 rows x 2 columns]
Clance answered 20/7, 2023 at 4:49 Comment(7)
@ MendelG: Thank you so much for your answer! I have started to convert parts of it to R and I think it might be working! How long do you think this code might take to fully scroll to the bottom of the page? A few minutes?Garboard
as i understand - the code you provided is only meant to scroll ... will I still have to use my existing code to get the names, addresses and long/latitudes of each restaurant? If you have time could you please extend your answer to include the extraction of the names, addresses, long/latitude of each restaurant in python? thank you so much! I really appreciate it!Garboard
Thank you so much for this update! I will work towards converting this to R code ... just to clarify: the modification you have written will first scroll to the bottom of the page until it is impossible to keep scrolling .... and then it will extract all names and addresses?Garboard
@Garboard correctClance
Thank you so much! is there an address column here?Garboard
@Garboard I've edited my answer: I found a different CSS selector where all the main data is, simply narrow down the CSS selectors to get the exact elements. in this example, we have the titles/addressesClance
I posted a similar question over here - can you please take a look if you have time? thank you so much #76891648Garboard
A
4

That is already a good start. I can name a few things I did to proceed, but note I mainly worked with python.

For locating elements within the DOM tree I suggest using xpath. It has a humanreadable syntax and is quite easy to learn.

https://devhints.io/xpath

Here you can find an overview of all possibilities to locate elements and a linked testbench by "Whitebeam.org" to train. Also helps understanding how to extract names. It will look something like this:

Returns an object for the given xpath expression

restaurant_adr <- remDr$findElement(using = 'xpath', "//*/*[@class="W4Efsd"]")

In this object we need to reference the desired attribute, probably .text() I am not sure about the syntax in R

restaurant_adr.text()

To scroll there is https://www.selenium.dev/documentation/webdriver/actions_api/wheel/ but it has no documentation for R

Or you could use javascript for scrolling.

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

https://cran.r-project.org/web/packages/js/vignettes/intro.html

Helpful resources:

https://statsandr.com/blog/web-scraping-in-r/

https://betterdatascience.com/r-web-scraping/

https://scrapfly.io/blog/web-scraping-with-r/#http-clients-crul

Arnett answered 18/7, 2023 at 7:51 Comment(2)
@ Knight: Thank you so much for your answer! I will try to read some of this stuff and get back to you!Garboard
If you have time, could you please write an answer in Python code? Even though I am working in R, I could then try to convert your answer into R. Thank you so much!Garboard

© 2022 - 2024 — McMap. All rights reserved.