Trouble getting the trade-price using "Requests-HTML" library

Asked 28/2, 2018 at 7:12 Answered 3/3, 2018 at 22:18

Solved python python-3.x web-scraping python-requests python-requests-html

I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator like selenium or something because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully. When I run the script, I get the following error. Any help on this will be highly appreciated.

Site address : webpage_link

The script I've tried with:

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.gdax.com/trade/LTC-EUR')
    js = r.html.render()
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
    print(item)

This is the complete traceback:

Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
    self._timeout)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
    raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'

The price I'm after is available on the top of the page which can be visible like this 177.59 EUR Last trade price. I wish to get 177.59 or whatever the current price is.

Ammonify answered 28/2, 2018 at 7:12 Comment(2)

Could it be due to the render function not returning the result object, but the result object still being r? What happens if you do r.html.search(...)? – Profitsharing 28/2, 2018 at 7:27

No improvement is there. I tried with .search() and got this error AttributeError: 'NoneType' object has no attribute 'search. – Ammonify 28/2, 2018 at 7:29

You have several errors. The first is a 'navigation' timeout, showing that the page didn’t complete rendering:

Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
    self._timeout)
  File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
    raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded

This traceback is not raised in the main thread, your code was not aborted because of this. Your page may or may not be complete; you may want to set a longer timeout or introduce a sleep cycle for the browser to have time to process AJAX responses.

Next, the response.html.render() element returns None. It loads the HTML into a headless Chromium browser, leaves JavaScript rendering to that browser, then copies back the page HTML into the response.html datasctructure in place, and nothing needs to be returned. So js is set to None, not a new HTML instance, causing your next traceback.

Use the existing response.html object to search, after rendering:

r.html.render()
item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)

There is most likely no such CSS class, because the last 5 characters are generated on each page render, after JSON data is loaded over AJAX. This makes it hard to use CSS to find the element in question.

Moreover, I found that without a sleep cycle, the browser has no time to fetch AJAX resources and render the information you wanted to load. Give it, say, 10 seconds of sleep to do some work before copying back the HTML. Set a longer timeout (the default is 8 seconds) if you see network timeouts:

r.html.render(timeout=10, sleep=10)

You could set the timeout to 0 too, to remove the timeout and just wait indefinitely until the page has loaded.

Hopefully a future API update also provides features to wait for network activity to cease.

You can use the included parse library to find the matching CSS classes:

# search for CSS suffixes
suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
for suffix in suffixes:
    # for each suffix, find all matching elements with that class
    items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
    for item in items:
        print(item.text)

Now we get output produced:

169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC

Your last traceback shows that the Chromium user data path could not be cleaned up. The underlying Pyppeteer library configures the headless Chromium browser with a temporary user data path, and in your case the directory contains some still-locked resource. You can ignore the error, although you may want to try and remove any remaining files in the .pyppeteer folder at a later time.

Ichthyoid answered 3/3, 2018 at 22:18 Comment(4)

You are very right about putting some delay and fixing css selector. The selector used in the above script is faulty because it is generated dynamically. I tried with [class^='MarketInfo_market-num_'] and r.html.render(sleep=10). I'm getting result now. The errors are still there along with desired result. Is there any way I can get rid of errors? – Isthmus 4/3, 2018 at 5:7

@SIM: if you see TimeoutErrors, set a larger timeout= value when rendering. Or set it to 0 to remove the timeout altogether. – Ichthyoid 4/3, 2018 at 9:39

r.html.render(timeout=10, sleep=10) in my case it starts a new session and again parses login page, any idea why ? – Gonzalez 10/12, 2019 at 14:19

@Pritish: nope, sorry. – Ichthyoid 12/12, 2019 at 18:24

Do you need it to go through Requests-HTML? On the day you posted, the repo was 4 days old and in the 3 days that have passed there have been 50 commits. It's not going to be completely stable for some time.

See here: https://github.com/kennethreitz/requests-html/graphs/commit-activity

OTOH, there is an API for gdax.

https://docs.gdax.com/#market-data

Now if you're dead set on using Py3, there is a python client listed on the GDAX website. Upfront I'll mention that it's the unofficial client; however, if you use this you'd be able to quickly and easily get responses from the official GDAX api.

https://github.com/danpaquin/gdax-python

Grishilda answered 3/3, 2018 at 12:3 Comment(1)

@shayan: then pick a site that doesn't cause issues right now. By the looks of it, the page just failed to load extra resources. – Ichthyoid 3/3, 2018 at 22:8

In case you want to use another way by running Selenium web scraping

from selenium import webdriver
from selenium.webdriver.common.keys import Keys 
from selenium.common.exceptions import TimeoutException


chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"    

driver = webdriver.Chrome(chrome_path)

driver.get("https://www.gdax.com/trade/LTC-EUR")

item = driver.find_element_by_xpath('''//span[@class='MarketInfo_market-num_1lAXs']''') 
item = item.text
print item
driver.close()

result:177.60 EUR

Doubletree answered 28/2, 2018 at 7:42 Comment(0)

Recommended topics

Hot tags