Fetch all href link using selenium in python
Asked Answered
N

11

53

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.

For example, I want all the links in the href= property of all the <a> tags on http://psychoticelites.com/

I've written a script and it is working. But, it's giving me the object address. I've tried using the id tag to get the value, but, it doesn't work.

My current script:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys


driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")

assert "Psychotic" in driver.title

continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)
Nephology answered 13/1, 2016 at 6:26 Comment(2)
What do you wand instead of the object address?Grindle
the actual 'VALUE' i.e., the link itself.Nephology
S
105

For Selenium >4.3.0, you can try the following:

from selenium.webdriver.common.by import By

elems = driver.find_elements(by=By.XPATH, "//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

You can also refer to answers here for a more detailed explanation.


Note: Below solution works for Selenium <4.3.0.

Well, you have to simply loop through the list:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

find_elements_by_* returns a list of elements (note the spelling of 'elements'). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).

Stepfather answered 13/1, 2016 at 6:33 Comment(9)
why is it that all the documentation says xpath is "not recommended" but most of the answers on stackoverflow use xpath?Impossibility
XPath is NOT reliable. If the DOM of the website changes, so does the XPath and your script is bound to crash then. After working with multiple scripts on scrapping, I've come to a conclusion that use XPath as a last resort.Nephology
short xpaths like in this example they are reliable, I use lots of driver.find_element_by_xpath("//*[@id='<my identifier>']") if xpath become long strings depending on columns/rows/divs etc that relies on layout they should not be used.Verdha
What if I need to return href's that belong to a specific class?Rsfsr
You can use this to get elements based on their Class Name driver.find_elements_by_class_name("content"), where "content" is the name of the class you're looking for.Nephology
.get_attribute is not available anymore, what's the new oneAbolish
@Abolish - Its still get_attribute in the docs as well.Stepfather
AttributeError: 'WebDriver' object has no attribute 'find_elements_by_xpath'Tod
This answer is not valid for newer versions. See #72755151Gamboa
C
10

I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)
Corrinnecorrival answered 29/4, 2020 at 23:43 Comment(3)
This creates a StaleElementReferenceException error for me on the line href=elem.get_attribute('href'). I tried printing the elem to the console before I access it to get the attribute but that just moves the exception to the line trying to print. this is the exact message: stale element reference: element is not attached to the page document Edit: forgot to press shift enter so I did not have the message. corrected in editTripedal
get_attribute is not working, what's the new method in seleium python ?Abolish
@Abolish get_attribute still works. find_elements_by_*** does not. See my updated posted answer.Cockadoodledoo
C
5
driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
driver.close()

Note: Adding delay is very important. First run it in debug mode and Make sure your URL page is getting loaded. If the page is loading slowly, increase delay (sleep time) and then extract.

If you still face any issues, please refer below link (explained with an example) or comment

Extract links from webpage using selenium webdriver

Carbrey answered 12/6, 2021 at 15:28 Comment(2)
I think the hint to the sleep command is helpful otherwise it is redundant to the accepted answer.Idelson
the Sleep command is completely relevant. Without it, you can't pick any href attributes because there was no time to load it. Upvoted this solution!Halberd
E
3

You can try something like:

    links = driver.find_elements_by_partial_link_text('')
Equally answered 31/8, 2017 at 11:44 Comment(1)
text are not the same ,all differentAbolish
C
3

All of the accepted answers using Selenium's driver.find_elements_by_*** no longer work with Selenium 4. The current method is to use find_elements() with the By class.

Method 1: For loop

The below code utilizes 2 lists. One for By.XPATH and the other, By.TAG_NAME. One can use either-or. Both are not needed.

By.XPATH IMO is the easiest as it does not return a seemingly useless None value like By.TAG_NAME does. The code also removes duplicates.

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

href_links = []
href_links2 = []

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")

for elem in elems:
    l = elem.get_attribute("href")
    if l not in href_links:
        href_links.append(l)

for elem in elems2:
    l = elem.get_attribute("href")
    if (l not in href_links2) & (l is not None):
        href_links2.append(l)

print(len(href_links))  # 360
print(len(href_links2))  # 360

print(href_links == href_links2)  # True

Method 2: List Comprehention

If duplicates are OK, one liner list comprehension can be used.

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
href_links = [e.get_attribute("href") for e in elems]

elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2]  # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]

print(len(href_links))  # 387
print(len(href_links2))  # 387

print(href_links == href_links2)  # True
Cockadoodledoo answered 9/8, 2022 at 11:51 Comment(0)
B
2

You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:

https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call "createDom" method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.

You can query the elements using the "find" method of HtmlDom object:

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

The above code will print all the links/urls present on the web page

Bonsai answered 21/2, 2017 at 13:9 Comment(0)
T
2

Unfortunately, the original link posted by OP is dead...

If you're looking for a way to scrape links on a page, here's how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:

from gazpacho import Soup

url = "https://mcmap.net/q/337431/-fetch-all-href-link-using-selenium-in-python/3731467"

soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")

[a.attrs["href"] for a in a_tags]
Tachograph answered 10/10, 2020 at 0:40 Comment(0)
B
1

You can do this by using BeautifulSoup with very easy and efficient way. I have tested the below codes and worked fine for the same purpose.

After this line -

driver.get("http://psychoticelites.com/")

use the below code -

response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
    if link.get('href'):
       print(link.get("href"))
       print('\n')
Brush answered 26/6, 2021 at 10:25 Comment(0)
C
1

For 2023:

url = "https://example.com"
driver.get(url)
raw_links = driver.find_elements(By.XPATH, '//a [@href]')
for link in raw_links:
    l = link.get_attribute("href")
    print("raw_link:{}".format(l))
Carothers answered 17/4, 2023 at 15:3 Comment(0)
S
0
import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
    print(link)
Somatology answered 1/8, 2019 at 11:46 Comment(0)
B
0

Update for the existing solving Post: For the current version it needs to be:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
Bankhead answered 5/7, 2022 at 8:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.