Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back

Asked 15/7, 2015 at 22:13 Answered 18/7, 2015 at 13:23

Solved javascript python ajax mechanize mechanize-python

I need to scrap some data from page, where I fill out the form (already did this with mechanize). The problem is, the page returns data on many pages, and I have troubles from getting the data from those pages.

There's no problem to get them from the first result page, since it displays already after the search - I simply submit the form and get the response.

I analyzed the source code of the results page and it seems it uses Java Script, RichFaces (some lib for JSF with ajax but I can be wrong since I am not a web expert).

However, I managed to figure out how to get to the remaining result pages. I need to click links which are in this form (href="javascript:void(0);", full code below):

<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>

<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&amp;b="></script><script type="text/javascript">

So I would like to ask if there's a way to click all the links and get all the pages using mechanize (note, that after » symbol there are more pages available)? I ask about answers for total dummies with web knowledge :)

Eugenol answered 15/7, 2015 at 22:13 Comment(0)

First of all, I would still stick to selenium since this is a quite "javascript-heavy" website. Note that you can use a headless browser (PhantomJS or with a virtual display) if needed.

The idea here would be to paginate by 100 rows per page, click on the ">>" link until it is not present on page, which would mean we've hit the last page and there are no more results to process. In order to make the solution reliable we need to use Explicit Waits: every time we proceed to a next page - wait for invisibility of the loading spinner.

Working implementation:

# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.maximize_window()

driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)

# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")

while True:
    # wait until there is no loading spinner
    wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))

    current_page = driver.find_element_by_class_name("rf-ds-act").text
    print("Current page: %d" % current_page)

    # TODO: collect the results

    # proceed to the next page
    try:
        next_page = driver.find_element_by_link_text(u"»")
        next_page.click()
    except NoSuchElementException:
        break

Resilience answered 18/7, 2015 at 13:23 Comment(2)

It seems your solution is better. I opened a new bounty to thank you for your answer :) – Eugenol 23/7, 2015 at 17:21

@Eugenol wow, thanks so much for it. Glad the answer helped to solve the problem. – Resilience 23/7, 2015 at 17:23

This works for me: it seems all the html is available in page

import time    
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')

next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'

pages = []
it = 0
while it < 1795:
    time.sleep(1)
    it += 1
    bad = True
    while bad:
        try:
            driver.find_element_by_id(next_id).click()
            bad = False 
        except:
            print('retry')

    page = driver.page_source

    pages.append(page)

Instead of first collecting and storing all html, you could also just query what you want, but you'll need lxml or BeautifulSoup for that.

EDIT: After running it indeed I noticed we got a mistake. It was simple to just catch the exception and retry.

Teddytedeschi answered 18/7, 2015 at 12:42 Comment(5)

Thank you so much for help :) I will try it in a while. Yeah, I agree, but BeautifulSoup is not a problem, I used it before, so I think I will handle it. However, I had problems with send_keys method, because after I automatically (from the source code) clicked the Search (Wyszukaj) button, page automatically cleared the criteria. Meh, who cares, if your approach will work, I will simply use BS4 for parsing. – Eugenol 18/7, 2015 at 13:19

Oh, I just noticed, you're THE GUY from yagmail - used your tool, I just wanted to thank you for this, its awesome! – Eugenol 18/7, 2015 at 13:22

Good luck! Pretty sure it will work :) Indeed, it's weird what exactly the page does, but simply retrying the element works... Also, if you want to be friendly to the page and be patient, feel free to add more delay. – Teddytedeschi 18/7, 2015 at 13:23

@Eugenol Hah, so cool to be called "THE GUY"; you're very welcome! – Teddytedeschi 18/7, 2015 at 13:24

Partly. I'm using your solution but it seems that it somehow 'repeats' pages and downloads some of them twice. However, I don't think it's a huge problem, I can handle this further, when parsing. Cheers :) – Eugenol 22/7, 2015 at 7:2

Recommended topics

Hot tags