Get href link using python playwright

Asked 6/7, 2023 at 1:10 Answered 10/3 at 19:55

Solved python web-scraping xpath playwright playwright-python

I am trying to extract the link inside a href but all I am finding it is the text inside the element

The website code is the following:

<div class="item-info-container ">
   <a href="/imovel/32600863/" role="heading" aria-level="2" class="item-link xh-highlight" 
   title="Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga">
   Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga
   </a>

And the code I am using is:

element_handle = page.locator('//div[@class="item-info-container "]//a').all_inner_texts()

No matter if I specify //a[@href] or not, my output is always the title text:

Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga

When what I really want to achieve is:

/imovel/32600863/

Any ideas of where my logic is failing me?

Syverson answered 6/7, 2023 at 1:10 Comment(4)

The ELEMENT you want is the <a> element. Once you have that element, you need to use get_attribute to fetch its href attribute. Playwright was not designed for web scraping. Why are you using it? There are several packages that were designed specifically for scraping. – Intolerance 6/7, 2023 at 1:28

Thanks for letting me know, I use playwright because it is the only one I was able to bypass DataDome with – Syverson 6/7, 2023 at 1:30

See this - stackoverflow.com/a/70750943 – Penn 11/3 at 10:41

@TimRoberts Playwright has useful features such as locator.wait_for(state='visible') and locator.scroll_into_view_if_needed(). What should be used instead of Playwright for scraping dynamic content? – Disapprobation 9/8 at 23:13

Using get_attribute:

link = page.locator('.item-info-container ').get_by_role('link').get_attribute('href')

More than one locator:

link_locators = page.locator('.item-info-container ').get_by_role('link').all()
for _ in link_locators:
    print(_.get_attribute('href'))

Uprush answered 6/7, 2023 at 6:59 Comment(4)

Code returns an error playwright._impl._api_types.Error: Error: strict mode violation: locator(".item-info-container").get_by_role("link") resolved to 30 elements: – Syverson 6/7, 2023 at 8:43

Added code if more than one – Uprush 6/7, 2023 at 10:11

Just saw that you sorted it out :) – Uprush 6/7, 2023 at 10:12

For some reason, repeatedly poking the browser via get_attribute('href') can be orders of magnitude slower than getting page.content() from the browser and parsing it with BeautifulSoup, for example

[div.find('a')['href'] for div in bs4.BeautifulSoup(page.content(), 'html.parser').find_all('div', class_='item-info-container')]

– Disapprobation 9/8 at 23:16

Managed to do it by finding all elements and then getting the attribute after handling all elements.

handleLinks = page.locator('//div[@class="item-info-container "]/a')
    for links in handleLinks.element_handles():
        linkF = links.get_attribute('href')
        print(linkF)

and the outcome would be:

/imovel/32611494/
/imovel/32642523/
/imovel/32633771/
/imovel/32527162/
/imovel/30344934/
/imovel/31221488/
/imovel/32477875/
/imovel/31221480/
/imovel/32450120/
/imovel/32515628/
/imovel/32299064/

Syverson answered 6/7, 2023 at 9:15 Comment(0)

Just omit the // and use the following XPath-1.0 expression:

//div[@class="item-info-container "]/a/@href

This will give you the @href attribute's value: "/imovel/32600863/".
Probably the whole command will be

element_handle = page.locator('//div[@class="item-info-container "]/a/@href').all_inner_texts()

but the result of the expression is not an element, but an attribute, so this may fail.

Turnedon answered 6/7, 2023 at 1:40 Comment(0)

This answer is optimal for getting the href from a single element, but an alternative approach for grabbing multiple href attributes is to use evaluate_all rather than use element handles. .all() is discouraged, since each subsequent .get_attribute() call will be a new inter-process network request on a handle that might be stale. In contrast, evaluate_all only does one network request to the browser, using synchronous JavaScript to extract all data in one shot.

Here's an example:

from playwright.sync_api import sync_playwright # 1.40.0


html = """<div class="item-info-container ">
<a href="/imovel/32600863/" role="heading" aria-level="2" class="item-link xh-highlight" title="Apartamento T3 ...">
  Apartamento T3 ...
</a></div>"""

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.set_content(html)
    links = (
        page.locator(".item-info-container a")
            .evaluate_all("els => els.map(el => el.href)")
    )
    print(links)
    browser.close()

I generally suggest avoiding XPath when CSS selectors suffice; the syntax is much cleaner.

Slipslop answered 10/3 at 19:55 Comment(0)

Recommended topics

Hot tags