Navigating to "url", waiting until "load" - Python Playwright Issue
Asked Answered
R

5

6

Hey I have code in python playwright for getting page source:

import json
import sys
import bs4
import urllib.parse
from bs4 import BeautifulSoup
server_proxy = urllib.parse.unquote(sys.argv[1])
link = urllib.parse.unquote(sys.argv[2])
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    #browser = p.chromium.launch(headless = False)
    browser = p.chromium.launch(proxy={"server": server_proxy,'username': 'xxx',"password": 'xxx' })
    context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36')
    page = context.new_page()
    cookie_file = open('cookies_tessco.json')
    cookies = json.load(cookie_file)
    context.add_cookies(cookies)
    page.goto(link)
    try:
        page.wait_for_timeout(10000)
        cont = page.content()
        print(cont)
        page.close()
        context.close()
        browser.close()      
    except Exception as e:
        print("Error in playwright script." + page)
        page.close()
        context.close()
        browser.close()      

This works okay, but sometimes I receive this error:

Traceback (most recent call last):
  File "page_tessco.py", line 17, in <module>
    page.goto(link)
  File "/usr/local/lib/python3.9/site-packages/playwright/sync_api/_generated.py", line 5774, in goto
    self._sync(
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_sync_base.py", line 103, in _sync
    return task.result()
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_page.py", line 464, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_frame.py", line 117, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 36, in send
    return await self.inner_send(method, params, False)
  File "/usr/local/lib/python3.9/site-packages/playwright/_impl/_connection.py", line 47, in inner_send
    result = await callback.future
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tessco.com/product/207882", waiting until "load"

I tried to add

page.wait_for_timeout(10000)

but still, these errors appear sometimes, any help, also im confused why this error appears only sometimes, what causes this error, if someone has experience please share it?

Rhinology answered 6/7, 2021 at 7:40 Comment(0)
T
4

The URL https://www.tessco.com/product/207882 loads quite slow.

Try to extend the default timeout of 30000ms adding a timeout to page.goto(link):

page.goto(link, timeout = 0)

With setting timeout to 0 you disable the timeout. Documentation

Alternatively, you can disable timeout with the following:

page.set_default_timeout(0)
page.goto(link)
Thievery answered 6/7, 2021 at 9:40 Comment(6)
I did it, however, i still sometimes receive the same error: playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.Rhinology
@HHHHHHT I tried your code but could not reproduce the error. I removed the proxy parameter from the p.chromium.launch() statement. Maybe try that.Thievery
Yeah it appears sometimes, not always. I cannot, because the website block my server ip, thats why u need to use proxiesRhinology
Maybe try a high number like 100000 instead of 0.Thievery
The 30000ms exceeded is the default value for the overall test timeout, not the navigate timeout. I'm not sure where this is configured in python though.Ieyasu
page.goto(link, timeout = 0) can hang forever, stifling errors. It's OK to set it to a few minutes or even an hour or so, but blocking forever is overkill and never really necessary. If there's something abnormal, I'd want a report of that situation so it can be dealt with.Commutable
I
1

Another alternative (for instances in which you sometimes experience timeouts) is to just keep retrying to load the page over and over by using a while loop that only breaks out of the loop if it is successful in its try block. The key here (and something I learned) is the fact that the continue statement in the except block doesn't return any exception, but rather retries the code within the while loop

import sleep

while True:
  try:
    page.goto(link)       
  except:
    sleep(<SLEEP FOR SOME AMOUNT OF SECONDS>)
    continue
  break

The sleep is optional here, but does give your network time to recover if its a networking issue. Also if you want to have max retires (instead of infinite) you can always do:

retries = 1
max_retries = 10
while retries <= max_retries:
  try:
    page.goto(link)       
  except:
    sleep(<SLEEP FOR SOME AMOUNT OF SECONDS>)
    retries += 1
    continue
  break

Immaterialize answered 21/7, 2023 at 17:15 Comment(0)
C
1

I suggest using page.goto(url, wait_until="domcontentloaded") which is faster than the default wait_until="load" option.

MDN explains the difference between the two events:

The load event is fired when the whole page has loaded, including all dependent resources such as stylesheets and images. This is in contrast to DOMContentLoaded, which is fired as soon as the page DOM has been loaded, without waiting for resources to finish loading.

See my blog post on these events for more detail. It's targeted to Puppeteer but applies equally to Playwright in this case.

Using domcontentloaded doesn't guarantee you won't see timeouts, blocks, or other weird behaviors, since every site is unique, but it should improve your results in general.

Playwright has also added a wait_until="commit" option which seems even faster than domcontentloaded and runs before the document starts loading. I recommend using this, with the caveat that I haven't used it much myself yet (but I'll try to update this post when I do).

Going a step further and blocking resources you don't need, like those stylesheets and images that MDN mentioned, is always good practice, and can help avoid loads getting stuck waiting for a slow resource and timing out. You can also disable JS, or even look into using a simple HTTP request and a static HTML parser like BeautifulSoup, if the data you want to scrape is in the static HTML.

When you do wait for domcontentloaded, you'll generally want to follow it with a locator to target the specific element(s) you want to manipulate once the page loads. Avoid sleeping or waiting for a network state--these are too imprecise, which leads to flakiness and slowness. Sometimes, wait_for_function() or wait_for_response() are appropriate, but the common case solution is a locator.

Regardless of what you do, never use timeout=0 because this turns a slow load with a visible error into a silent infinite loop. Your script will eventually need to be killed by hand to avoid becoming a zombie, and you won't have the benefit of a stack trace. Even if the script is mission-critical, you should throw an error after a few minutes, catch and log it and retry the nav so you have some visibility into the problem and can take steps to solve it.

Commutable answered 19/3, 2024 at 2:31 Comment(0)
W
0

To change timeout, I suggest you should use page.set_default_navigation_timeout because page.set_default_navigation_timeout() takes priority over page.set_default_timeout(), browser_context.set_default_timeout() and browser_context.set_default_navigation_timeout().

Disable timeout by setting timeout value to 0 is not recommended because it could wait forever. Instead, please use a big timeout value like 600_000 ms (10 minutes).

page.set_default_navigation_timeout(600_000)

And as @ggorlen suggested, you should add wait_until option to your page.goto(). For best practice, please add wait logic to before and after doing actions.

This may not related to your case, but do the site using some form of rate limit? If so, you should slowing down your scrapping.

Witenagemot answered 26/3, 2024 at 3:41 Comment(0)
F
0

In my case the problem was the parallelism of the different tests. I wanted to test my application with firefox, chromium and webkit, which caused a

Error: page.goto: NS_ERROR_CONNECTION_REFUSED

In my case the solution was to run the test with the --workers flag

e.g.:

nx e2e *YourProjectNameHere* --workers 4

I guess you can play around with the number of workers a bit. Hope it was helpful.

Floorboard answered 15/5, 2024 at 17:37 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.