Get href link using python playwright
Asked Answered
S

4

7

I am trying to extract the link inside a href but all I am finding it is the text inside the element

The website code is the following:

<div class="item-info-container ">
   <a href="/imovel/32600863/" role="heading" aria-level="2" class="item-link xh-highlight" 
   title="Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga">
   Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga
   </a>

And the code I am using is:

element_handle = page.locator('//div[@class="item-info-container "]//a').all_inner_texts()

No matter if I specify //a[@href] or not, my output is always the title text:

Apartamento T3 na avenida da Liberdade, São José de São Lázaro e São João do Souto, Braga

When what I really want to achieve is:

/imovel/32600863/

Any ideas of where my logic is failing me?

Syverson answered 6/7, 2023 at 1:10 Comment(4)
The ELEMENT you want is the <a> element. Once you have that element, you need to use get_attribute to fetch its href attribute. Playwright was not designed for web scraping. Why are you using it? There are several packages that were designed specifically for scraping.Intolerance
Thanks for letting me know, I use playwright because it is the only one I was able to bypass DataDome withSyverson
See this - stackoverflow.com/a/70750943Penn
@TimRoberts Playwright has useful features such as locator.wait_for(state='visible') and locator.scroll_into_view_if_needed(). What should be used instead of Playwright for scraping dynamic content?Disapprobation
U
13

Using get_attribute:

link = page.locator('.item-info-container ').get_by_role('link').get_attribute('href')

More than one locator:

link_locators = page.locator('.item-info-container ').get_by_role('link').all()
for _ in link_locators:
    print(_.get_attribute('href'))
Uprush answered 6/7, 2023 at 6:59 Comment(4)
Code returns an error playwright._impl._api_types.Error: Error: strict mode violation: locator(".item-info-container").get_by_role("link") resolved to 30 elements:Syverson
Added code if more than oneUprush
Just saw that you sorted it out :)Uprush
For some reason, repeatedly poking the browser via get_attribute('href') can be orders of magnitude slower than getting page.content() from the browser and parsing it with BeautifulSoup, for example [div.find('a')['href'] for div in bs4.BeautifulSoup(page.content(), 'html.parser').find_all('div', class_='item-info-container')]Disapprobation
S
2

Managed to do it by finding all elements and then getting the attribute after handling all elements.

handleLinks = page.locator('//div[@class="item-info-container "]/a')
    for links in handleLinks.element_handles():
        linkF = links.get_attribute('href')
        print(linkF)

and the outcome would be:

/imovel/32611494/
/imovel/32642523/
/imovel/32633771/
/imovel/32527162/
/imovel/30344934/
/imovel/31221488/
/imovel/32477875/
/imovel/31221480/
/imovel/32450120/
/imovel/32515628/
/imovel/32299064/
Syverson answered 6/7, 2023 at 9:15 Comment(0)
T
1

Just omit the // and use the following XPath-1.0 expression:

//div[@class="item-info-container "]/a/@href

This will give you the @href attribute's value: "/imovel/32600863/".
Probably the whole command will be

element_handle = page.locator('//div[@class="item-info-container "]/a/@href').all_inner_texts()

but the result of the expression is not an element, but an attribute, so this may fail.

Turnedon answered 6/7, 2023 at 1:40 Comment(0)
S
0

This answer is optimal for getting the href from a single element, but an alternative approach for grabbing multiple href attributes is to use evaluate_all rather than use element handles. .all() is discouraged, since each subsequent .get_attribute() call will be a new inter-process network request on a handle that might be stale. In contrast, evaluate_all only does one network request to the browser, using synchronous JavaScript to extract all data in one shot.

Here's an example:

from playwright.sync_api import sync_playwright # 1.40.0


html = """<div class="item-info-container ">
<a href="/imovel/32600863/" role="heading" aria-level="2" class="item-link xh-highlight" title="Apartamento T3 ...">
  Apartamento T3 ...
</a></div>"""

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.set_content(html)
    links = (
        page.locator(".item-info-container a")
            .evaluate_all("els => els.map(el => el.href)")
    )
    print(links)
    browser.close()

I generally suggest avoiding XPath when CSS selectors suffice; the syntax is much cleaner.

Slipslop answered 10/3 at 19:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.