Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back
Asked Answered
E

2

7

I need to scrap some data from page, where I fill out the form (already did this with mechanize). The problem is, the page returns data on many pages, and I have troubles from getting the data from those pages.

There's no problem to get them from the first result page, since it displays already after the search - I simply submit the form and get the response.

I analyzed the source code of the results page and it seems it uses Java Script, RichFaces (some lib for JSF with ajax but I can be wrong since I am not a web expert).

However, I managed to figure out how to get to the remaining result pages. I need to click links which are in this form (href="javascript:void(0);", full code below):

<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>

<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&amp;b="></script><script type="text/javascript">

So I would like to ask if there's a way to click all the links and get all the pages using mechanize (note, that after » symbol there are more pages available)? I ask about answers for total dummies with web knowledge :)

Eugenol answered 15/7, 2015 at 22:13 Comment(0)
R
4

First of all, I would still stick to selenium since this is a quite "javascript-heavy" website. Note that you can use a headless browser (PhantomJS or with a virtual display) if needed.

The idea here would be to paginate by 100 rows per page, click on the ">>" link until it is not present on page, which would mean we've hit the last page and there are no more results to process. In order to make the solution reliable we need to use Explicit Waits: every time we proceed to a next page - wait for invisibility of the loading spinner.

Working implementation:

# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.maximize_window()

driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)

# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")

while True:
    # wait until there is no loading spinner
    wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))

    current_page = driver.find_element_by_class_name("rf-ds-act").text
    print("Current page: %d" % current_page)

    # TODO: collect the results

    # proceed to the next page
    try:
        next_page = driver.find_element_by_link_text(u"»")
        next_page.click()
    except NoSuchElementException:
        break
Resilience answered 18/7, 2015 at 13:23 Comment(2)
It seems your solution is better. I opened a new bounty to thank you for your answer :)Eugenol
@Eugenol wow, thanks so much for it. Glad the answer helped to solve the problem.Resilience
T
2

This works for me: it seems all the html is available in page

import time    
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')

next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'

pages = []
it = 0
while it < 1795:
    time.sleep(1)
    it += 1
    bad = True
    while bad:
        try:
            driver.find_element_by_id(next_id).click()
            bad = False 
        except:
            print('retry')

    page = driver.page_source

    pages.append(page)

Instead of first collecting and storing all html, you could also just query what you want, but you'll need lxml or BeautifulSoup for that.

EDIT: After running it indeed I noticed we got a mistake. It was simple to just catch the exception and retry.

Teddytedeschi answered 18/7, 2015 at 12:42 Comment(5)
Thank you so much for help :) I will try it in a while. Yeah, I agree, but BeautifulSoup is not a problem, I used it before, so I think I will handle it. However, I had problems with send_keys method, because after I automatically (from the source code) clicked the Search (Wyszukaj) button, page automatically cleared the criteria. Meh, who cares, if your approach will work, I will simply use BS4 for parsing.Eugenol
Oh, I just noticed, you're THE GUY from yagmail - used your tool, I just wanted to thank you for this, its awesome!Eugenol
Good luck! Pretty sure it will work :) Indeed, it's weird what exactly the page does, but simply retrying the element works... Also, if you want to be friendly to the page and be patient, feel free to add more delay.Teddytedeschi
@Eugenol Hah, so cool to be called "THE GUY"; you're very welcome!Teddytedeschi
Partly. I'm using your solution but it seems that it somehow 'repeats' pages and downloads some of them twice. However, I don't think it's a huge problem, I can handle this further, when parsing. Cheers :)Eugenol

© 2022 - 2024 — McMap. All rights reserved.