Python/Selenium web scrap how to find hidden src value from a links?
Asked Answered
E

2

6

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.

I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such

all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
    a = promo.find_elements_by_tag_name("a")
    print("a[0]: ", a[0].get_attribute("href"))

However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.

Are there any ways around getting the links of all these items?

Edit: Are there any ways to retrieve all the links of the items on the pages?

i.e.

https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...

Edit: Adding an image of one such anchor tag for better clarity: enter image description here

Encode answered 15/1, 2022 at 12:19 Comment(6)
Update the question with the text based HTML of those elementsResponsum
Can you let us know which elements are they ?Triennial
document.querySelector('#__layout > div > div > main > div > div > div.collection-list.promotion-list.block-list > ul > li.first > div').click() will open the first promotion, which means there's no hidden href in the <a> tag, but instead it's calling Javascript on that page. The <a> tag is misleading because it's probably there just to change the mouse pointer when hovering over the promotion.Toshikotoss
Is there any way to retrieve the actual links of the items on the page?Encode
@Encode Your updated screenshot shows us a <a> tag with not href / onclick, where as you mentioned about retrieve any href, onclick attributes.Responsum
You will first need to find out whether there are any requests when you click on Suntec City, for example. So, open your browser, open your Dev Tools, go to the Network tab and click on Suntec City. Is the content of the Network tab of your Dev Tools changing? If so, how?Paisa
T
3

By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:

window.__NUXT__.state.Promotion.promotions[0].HappeningID

Based on that, you can create a Python loop to get all the promotions:

items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
    base = "https://sunteccity.com.sg/promotions/"
    happening_id = str(item["HappeningID"])
    print(base + happening_id)

That generated the following output:

https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
Toshikotoss answered 15/1, 2022 at 20:29 Comment(3)
Just Awesome...I did explore that js file but couldn't reach to the happening id you have mentioned..will check again tomorrowOverbear
Hey Michael, thanks for your answer - I'm curious if you have tried it on headless browser mode, options.add_argument('headless'), wondering if it still works for you. It seems to return the links only half the times (tried running about 10 times)Encode
Hi Max, it's working for me on headless mode too. Maybe you need a delay of a few seconds between the page load and the loop. I'm running everything with the SeleniumBase framework, in case there's a difference. The SeleniumBase headless mode adds other command-line options to make things run cleaner, and to avoid bot-detection.Toshikotoss
F
0

You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be

all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
    a = promo.find_elements_by_tag_name("a")
    print("a[0]: ", a[0].get_attribute("href"))

You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:

links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
    print(link.get_attribute("href"))
Fluorosis answered 15/1, 2022 at 19:47 Comment(2)
I don't think this returns the results I am looking for because there is no attribute href in the a tags...Encode
Well... I'm sorry. I see. there is no links inside the web elements, they are containing the images only. I see Michael's solution above, it's interesting, however it's done with JavaScript reverse engineering, not with Selenium. Looks like the links are generated by JavaScript after clicking on the elements only.Fluorosis

© 2022 - 2024 — McMap. All rights reserved.