Wait until page is loaded with Selenium WebDriver for Python
Asked Answered
D

16

345

I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.

Dias answered 25/10, 2014 at 20:14 Comment(4)
It might help to know a little more about the page. Are the elements sequential or predictable? You could wait for elements to load by checking visiblity using id or xpathUnideaed
I am crawling the following page: pinterest.com/cremedelacrumb/yumDias
possible duplicate of Reliably detect page load or time out, Selenium 2Packsaddle
Does this answer your question? Wait for page load in SeleniumSchertz
L
455

The webdriver will wait for a page to load by default via .get() method.

As you may be looking for some specific element as @user227215 said, you should use WebDriverWait to wait for an element located in your page:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

I have used it for checking alerts. You can use any other type methods to find the locator.

EDIT 1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

Lennie answered 25/10, 2014 at 21:44 Comment(15)
What is "IdOfMyElement"? Is it something I should predict like the index of something will be loaded newly? For example, I want to crawl the following page: pinterest.com/cremedelacrumb/yumDias
You should find an element in your page which you're sure that always exists in the page. "IdOfMyElement" refers to an element's ID in the page; if it doesn't own an ID, you can use any other type of locator, like xpath.Lennie
I think it should not be something always existing. It should be something that will be newly loaded once scrolling down. Am I right? For example, can you tell me what is this element of the page I said before?Dias
The link <a href="/" id="logo" class="logo" data-force-refresh="1" data-element-type="146">Pinterest</a> is such an element in the link you have provided. BTW, chekout my edit.Lennie
I was getting "find_element() argument after * must be a sequence, not WebElement" changed to "WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, "IdOfMyElement"))) " see manual selenium-python.readthedocs.org/en/latest/waits.htmlHaberman
The comment by @Haberman and the answer by David Cullen were what worked for me. Perhaps this accepted answer could be updated accordingly?Recurved
Passing browser.find_element_by_id('IdOfMyElement') causes a NoSuchElementException to be raised. The documentation says to pass a tuple that looks like this: (By.ID, 'IdOfMyElement'). See my answerBhatt
Hopefully this helps someone else out because it wasn't clear to me initially: WebDriverWait will actually return a web object that you can then perform an action on (e.g. click()), read text out of etc. I was under the mistaken impression that it just caused a wait, after which you still had to find the element. If you do a wait, then a find element afterward, selenium will error out because it tries to find the element while the old wait is still processing (hopefully that makes sense). Bottom line is, you don't need to find the element after using WebDriverWait -- it is already an object.Bloodmobile
Does the webdriver wait for the images to be loaded before continuing with the rest of the script?Glover
@PetarVasilev, if you're referring to get method, you can read this answer.Lennie
@Gopgop Wow this is so ugly is not a constructive comment. What is ugly about it? How could it be made better?Glynas
@ModusTollens The selenium authors need to export simple functions similar to "find_element_by_id" for example, "wait_for_element_by_id". The current way of doing this is off the API structure.Irresolute
A more serious problem is when the browser returns stale entries after clicking. Waiting for some time is the only solution. Even then it is not guaranteed to work if the page takes more time to load.Millepore
I use EC.visibility_of_element_located instead of EC.precense_of_element_located because visibility will wait until it is shown from being hidden and presence won't.Phonon
For the first time, yes, get() might wait for the entire page to load, but then how about when we click on a button and that takes us to a different page and now we have to wait for the page to entirely load?Pantoja
B
121

Trying to pass find_element_by_id to the constructor for presence_of_element_located (as shown in the accepted answer) caused NoSuchElementException to be raised. I had to use the syntax in fragles' comment:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Firefox()
driver.get('url')
timeout = 5
try:
    element_present = EC.presence_of_element_located((By.ID, 'element_id'))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print "Timed out waiting for page to load"

This matches the example in the documentation. Here is a link to the documentation for By.

Bhatt answered 18/5, 2016 at 14:49 Comment(4)
Thank you! yes, this was needed for me too. ID isn't the only attribute that can be used, to get full list, use help(By). E.g. I used EC.presence_of_element_located((By.XPATH, "//*[@title='Check All Q1']"))Recurved
That's the way it works for me as well! I wrote an additional answer expanding on the different locators that are available with the By object.Infestation
I've posted a followup question dealing with expectations where different pages may be loaded, and not always the same page: #51642046Adriell
In some cases this method does not work. For example, if you scrape page one and then get page two of a same website, all Ids in two pages are the same and .until(element_present) always will be True.Stratigraphy
P
74

Find below 3 methods:

readyState

Checking page readyState (not reliable):

def page_has_loaded(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    page_state = self.driver.execute_script('return document.readyState;')
    return page_state == 'complete'

The wait_for helper function is good, but unfortunately click_through_to_new_page is open to the race condition where we manage to execute the script in the old page, before the browser has started processing the click, and page_has_loaded just returns true straight away.

id

Comparing new page ids with the old one:

def page_has_loaded_id(self):
    self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
    try:
        new_page = browser.find_element_by_tag_name('html')
        return new_page.id != old_page.id
    except NoSuchElementException:
        return False

It's possible that comparing ids is not as effective as waiting for stale reference exceptions.

staleness_of

Using staleness_of method:

@contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
    self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
    old_page = self.find_element_by_tag_name('html')
    yield
    WebDriverWait(self, timeout).until(staleness_of(old_page))

For more details, check Harry's blog.

Packsaddle answered 21/5, 2015 at 23:9 Comment(4)
Why do you say that self.driver.execute_script('return document.readyState;') not reliable? It seems to work perfectly for my use case, which is waiting for a static file to load in a new tab (which is opened via javascript in another tab instead of .get()).Counterreply
@ArthurHebert Could be not reliable due to race condition, I've added relevant cite.Packsaddle
Thank you so much for this response. I had searched so many places to find something that would work while scraping many different websites that all have different elements that load. None of the other results worked. To get mine working all I did was do a "ready = 'nope', while ready != 'complete': ready = dv.execute_script(xxxx). Works great for what I need it to.Bertrambertrand
Obey The Testing Goat: - I'm trying to scrape a WordPress wpDataTable front-end to a database on a site that used to just present their data as HTML pages. Now the site presents the data where to get to the second page you need to JavaScript the page. A variation on the "Current Working Solution" got me over my immediate goal line (5 successful scrapes out of 5 tries).Chucho
I
51

As mentioned in the answer from David Cullen, I've always seen recommendations to use a line like the following one:

element_present = EC.presence_of_element_located((By.ID, 'element_id'))
WebDriverWait(driver, timeout).until(element_present)

It was difficult for me to find somewhere all the possible locators that can be used with the By, so I thought it would be useful to provide the list here. According to Web Scraping with Python by Ryan Mitchell:

ID

Used in the example; finds elements by their HTML id attribute

CLASS_NAME

Used to find elements by their HTML class attribute. Why is this function CLASS_NAME not simply CLASS? Using the form object.CLASS would create problems for Selenium's Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead.

CSS_SELECTOR

Finds elements by their class, id, or tag name, using the #idName, .className, tagName convention.

LINK_TEXT

Finds HTML tags by the text they contain. For example, a link that says "Next" can be selected using (By.LINK_TEXT, "Next").

PARTIAL_LINK_TEXT

Similar to LINK_TEXT, but matches on a partial string.

NAME

Finds HTML tags by their name attribute. This is handy for HTML forms.

TAG_NAME

Finds HTML tags by their tag name.

XPATH

Uses an XPath expression ... to select matching elements.

Infestation answered 14/10, 2016 at 7:19 Comment(3)
The documentation for By lists the attributes which can be used as locators.Bhatt
That was what I had been looking for! Thanks! Well, now it should be easier to find as google was sending me to this question, but not to the official documentation.Infestation
Thanks for the citation from the book. It is much clearer than the documentation.Eelpout
B
23

From selenium/webdriver/support/wait.py

driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
    lambda x: x.find_element_by_id("someId"))
Bracken answered 26/1, 2017 at 12:17 Comment(0)
M
14

Have you tried driver.implicitly_wait. It is like a setting for the driver, so you only call it once in the session and it basically tells the driver to wait the given amount of time until each command can be executed.

driver = webdriver.Chrome()
driver.implicitly_wait(10)

So if you set a wait time of 10 seconds it will execute the command as soon as possible, waiting 10 seconds before it gives up. I've used this in similar scroll-down scenarios so I don't see why it wouldn't work in your case. Hope this is helpful.

To be able to fix this answer, I have to add new text. Be sure to use a lower case 'w' in implicitly_wait.

Misogamy answered 13/5, 2018 at 4:36 Comment(2)
What is the difference between implicitly wait and webdriverwait?Abandoned
@song0089 Check this, this and this discussions.Weigela
A
9

Here I did it using a rather simple form:

from selenium import webdriver
browser = webdriver.Firefox()
browser.get("url")
searchTxt=''
while not searchTxt:
    try:    
      searchTxt=browser.find_element_by_name('NAME OF ELEMENT')
      searchTxt.send_keys("USERNAME")
    except:continue
Aciculate answered 27/10, 2018 at 15:44 Comment(0)
R
9

Solution for ajax pages that continuously load data. The previews methods stated do not work. What we can do instead is grab the page dom and hash it and compare old and new hash values together over a delta time.

import time
from selenium import webdriver

def page_has_loaded(driver, sleep_time = 2):
    '''
    Waits for page to completely load by comparing current page hash values.
    '''

    def get_page_hash(driver):
        '''
        Returns html dom hash
        '''
        # can find element by either 'html' tag or by the html 'root' id
        dom = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
        # dom = driver.find_element_by_id('root').get_attribute('innerHTML')
        dom_hash = hash(dom.encode('utf-8'))
        return dom_hash

    page_hash = 'empty'
    page_hash_new = ''
    
    # comparing old and new page DOM hash together to verify the page is fully loaded
    while page_hash != page_hash_new: 
        page_hash = get_page_hash(driver)
        time.sleep(sleep_time)
        page_hash_new = get_page_hash(driver)
        print('<page_has_loaded> - page not loaded')

    print('<page_has_loaded> - page loaded: {}'.format(driver.current_url))
Radian answered 22/7, 2020 at 20:43 Comment(4)
Granted, I haven't experimented with this extensively yet, but this seems to be EXACTLY what I was looking for! Thanks!Leptosome
The best answer, actually. Thank you very much @SoRobby!Utas
As I know, find_element_by_tag_name is not working anymore. driver.find_element(By.TAG_NAME, 'html') works.Utas
Hashing whole dom in a while loop? How to say it more precisely... it is not very efficient. :D One of those solutions, where people fetch million of rows from database in a loop, one by one. :DGregarine
A
6

How about putting WebDriverWait in While loop and catching the exceptions.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
while True:
    try:
        WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('IdOfMyElement')))
        print "Page is ready!"
        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print "Loading took too much time!-Try again"
Alejandroalejo answered 8/5, 2017 at 6:44 Comment(2)
you dont need the loop?Acred
why do you need while True here, as i know until method will wait until deplay timeoutAstonied
O
6

You can do that very simple by this function:

def page_is_loading(driver):
    while True:
        x = driver.execute_script("return document.readyState")
        if x == "complete":
            return True
        else:
            yield False

and when you want do something after page loading complete,you can use:

Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")

while not page_is_loading(Driver):
    continue

Driver.execute_script("alert('page is loaded')")
Osteology answered 10/7, 2020 at 8:23 Comment(3)
that's purey CPU Blocking script.Terrific
Downvoted, it is a really inefficient busy waiting, no one should do thatSnavely
Upvoted for correctness. Optimality is a separate issue, but this works in general.Keratogenous
U
6

selenium can't detect when the page is fully loaded or not, but javascript can. I suggest you try this.

from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(driver, 100).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')

this will execute javascript code instead of using python, because javascript can detect when page is fully loaded, it will show 'complete'. This code means in 100 seconds, keep tryingn document.readyState until complete shows.

Undulant answered 19/7, 2022 at 10:28 Comment(0)
F
5

use this in code :

from selenium import webdriver

driver = webdriver.Firefox() # or Chrome()
driver.implicitly_wait(10) # seconds
driver.get("http://www.......")

or you can use this code if you are looking for a specific tag :

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox() #or Chrome()
driver.get("http://www.......")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "tag_id"))
    )
finally:
    driver.quit()
Favored answered 16/8, 2020 at 13:43 Comment(0)
P
1

Very good answers here. Quick example of wait for XPATH.

# wait for sizes to load - 2s timeout
try:
    WebDriverWait(driver, 2).until(expected_conditions.presence_of_element_located(
        (By.XPATH, "//div[@id='stockSizes']//a")))
except TimeoutException:
    pass
Plasterwork answered 18/1, 2021 at 12:23 Comment(0)
R
1

I struggled a bit to get this working as that didn't worked for me as expected. anyone who is still struggling to get this working, may check this.

I want to wait for an element to be present on the webpage before proceeding with my manipulations.

we can use WebDriverWait(driver, 10, 1).until(), but the catch is until() expects a function which it can execute for a period of timeout provided(in our case its 10) for every 1 sec. so keeping it like below worked for me.

element_found = wait_for_element.until(lambda x: x.find_element_by_class_name("MY_ELEMENT_CLASS_NAME").is_displayed())

here is what until() do behind the scene

def until(self, method, message=''):
        """Calls the method provided with the driver as an argument until the \
        return value is not False."""
        screen = None
        stacktrace = None

        end_time = time.time() + self._timeout
        while True:
            try:
                value = method(self._driver)
                if value:
                    return value
            except self._ignored_exceptions as exc:
                screen = getattr(exc, 'screen', None)
                stacktrace = getattr(exc, 'stacktrace', None)
            time.sleep(self._poll)
            if time.time() > end_time:
                break
        raise TimeoutException(message, screen, stacktrace)
Raster answered 6/9, 2021 at 7:5 Comment(0)
H
1

If you are trying to scroll and find all items on a page. You can consider using the following. This is a combination of a few methods mentioned by others here. And it did the job for me:

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_1 = len(elem1)
        print(f"A list Length {len_elem_1}")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_2 = len(elem2)
        print(f"B list Length {len_elem_2}")
        if len_elem_1 == len_elem_2:
            print(f"final length = {len_elem_1}")
            break
    except TimeoutException:
            print("Loading took too much time!")
Hesta answered 30/11, 2021 at 20:18 Comment(0)
D
0
nono = driver.current_url
driver.find_element(By.XPATH,"//button[@value='Send']").click()
  while driver.current_url == nono:
      pass
print("page loaded.")
Deport answered 6/11, 2022 at 6:8 Comment(1)
Your answer could be improved by adding more information on what the code does and how it helps the OP.Pietje

© 2022 - 2024 — McMap. All rights reserved.