How to retrieve the values of dynamic html content using Python
Asked Answered
Q

4

15

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\\n", "\n")
print(data)

Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".

How can I make it so that I can retrieve the value instead of the placeholder text?

Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}

Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.

The code I have now is:

from bs4 import BeautifulSoup
from selenium import webdriver

#...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

Quincuncial answered 11/7, 2013 at 15:24 Comment(3)
Do you get the template tags when you visit the URL in the browser? EDIT: Also, How are your templates rendered. If you are using a javascript template engine (e.g. Handlebars) this will probably mean you will get the template tags in the response.Cyclo
RE edit 2 - this is just about a new question... anyway, I think you need to have a look at the documentation for find_all as your find_all string is not valid. I'll update below with something a bit closer to what you need crummy.com/software/BeautifulSoup/bs3/….Cyclo
Cheers! I tried using soup.findall(True) to just get all the tags, and the information I need is in there! It'll just be a matter of finding exactly which tag I need to search to get that information.Quincuncial
C
33

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).

This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

  1. parse the ajax JSON directly
  2. use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
  3. use a browser automation tool splinter

This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.


EDIT

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
    print tag.text

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

Cyclo answered 11/7, 2013 at 17:35 Comment(5)
Thanks for the help. I have very little experience with web languages or web-based programming, but I'll link the site I'm trying to parse data from if that helps.Quincuncial
I'll start looking into requests and beautifulsoup too.Quincuncial
I've had a look at the site - it nearly broke my computer a few times loading :) Yep, if you are Chrome hit F12 and if you go to the "Network" tab you will see Backbone, underscore and handlebars are all loaded. I think you will have to go the selenium approach. I'll edit with some sample codeCyclo
Thanks again. I've tried what you've recommended and updated my post. :)Quincuncial
What is the best solution for using on server? Is it good practice to use selenium on server(not local machine)? @CycloLacustrine
L
2

I used selenium + chrome

 from selenium import webdriver
 from selenium.webdriver.chrome.options import Options

 url = "www.sitetotarget.com"
 options = Options()
 options.add_argument('--headless')
 options.add_argument('--disable-gpu')
 options.add_argument('--no-sandbox')
 options.add_argument('--disable-dev-shm-usage')`
Leanto answered 15/11, 2020 at 7:0 Comment(0)
C
1

Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.

Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.

import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()

Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:

url = 'http://www.example.org' ## Any website
driver.get(url)

pageSource = driver.page_source
print(driver.get_window_size())

locations = []

for element in driver.find_elements_by_tag_name("a"):

    location = element.location;
    size = element.size;
    # Collect coordinates of object: left/right, top/bottom 
    x1 = location['x'];
    y1 = location['y'];
    x2 = location['x']+size['width'];
    y2 = location['y']+size['height'];
    locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])     
locations = [ (el,x1,y1,x2,y2, width,height)
    for el,x1,y1,x2,y2,width,height in locations
    if not (        
            ## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
            x2 <= x1 or y2 <= y1 or x2<0 or y2<0
            ## Further restrict if you expect the objects to be around a specific size
            ## or width<200 or height<100
           )
]

for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
    try:
        print('-'*100,f'({width},{height})')
        print(el.text[:100])
        element_png = el.screenshot_as_png
        with open('/tmp/_pageImage.png', 'wb') as f:
            f.write(element_png)
        img = Image.open('/tmp/_pageImage.png')
        display(img)
    except Exception as err:
        print(err)

enter image description here

Installation for mac+chrome:

pip install selenium
brew cask install chromedriver
brew cask install google-chrome

I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.

Regarding responsibility, please respect robots.txt on each site.

Chinachinaberry answered 21/11, 2020 at 1:3 Comment(0)
S
0

I know this is an old question, but sometimes there is a better solution than using heavy selenium.

This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal. Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Sastruga answered 3/8, 2021 at 14:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.