Scraping with __doPostBack with link url hidden
Asked Answered
Q

2

11

I am trying to scrape search results from website that uses a __doPostBack function. The webpage displays 10 results per search query. To see more results, one has to click a button that triggers a __doPostBack javascript. After some research, I realized that the POST request behaves just like a form, and that one could simply use scrapy's FormRequest to fill that form. I used the following thread:

Troubles using scrapy with javascript __doPostBack method

to write the following script.

# -*- coding: utf-8 -*- 
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.selector import Selector
from ahram.items import AhramItem
import re

class MySpider(CrawlSpider):
    name = u"el_ahram2"

    def start_requests(self):
        search_term = u'اقتصاد'
        baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
        requests = []
        for i in range(1, 4):#crawl first 3 pages as a test
            argument =  u"'Page$"+ str(i+1) + u"'"
            data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument}
            currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)
            requests.append(currentPage)
        return requests

    def fetch_articles(self, response):
        sel = Selector(response)
        for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): 
            yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items)

    def parse_items(self, response):
        sel = Selector(response)
        the_title = ' '.join(sel.xpath("//title/text()").extract()).replace('\n','').replace('\r','').replace('\t','')#* mean 'anything'
        the_authors = '---'.join(sel.xpath("//*[contains(@id,'editorsdatalst_HyperLink')]//text()").extract())
        the_text = ' '.join(sel.xpath("//span[@id='TextBox2']/text()").extract())
        the_month_year = ' '.join(sel.xpath("string(//span[@id = 'Label1'])").extract())
        the_day = ' '.join(sel.xpath("string(//span[@id = 'Label2'])").extract())
        item = AhramItem()
        item["Authors"] = the_authors
        item["Title"] = the_title
        item["MonthYear"] = the_month_year
        item["Day"] = the_day
        item['Text'] = the_text
        return item

My problem now is that 'fetch_articles' is never called:

2014-05-27 12:19:12+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST     http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] DEBUG: Crawled (200) <POST http://digital.ahram.org.eg/sresult.aspx?srch=%D8%A7%D9%82%D8%AA%D8%B5%D8%A7%D8%AF&archid=1> (referer: None)
2014-05-27 12:19:13+0200 [el_ahram2] INFO: Closing spider (finished)

After searching for several days I feel completely stuck. I am a beginner in python, so perhaps the error is trivial. However if it is not, this thread could be of use to a number of people. Thank you in advance for you help.

Quest answered 27/5, 2014 at 9:28 Comment(2)
Can you try with deriving from BaseSpider rather than CrawlSpider ?Cabrales
I just did and I get the same results. Perhaps I don't fully understand the difference between BaseSpider and CrawlSpider, but to me it's not clear it should change anything.Quest
O
4

Your code is fine. fetch_articles is running. You can test it by adding a print statement.

However, the website requires you to validate POST requests. In order to validate them, you must have __EVENTVALIDATION and __VIEWSTATE in your request body to prove you are responding to their form. In order to get these, you need to first make a GET request, and extract these fields from the form. If you don't provide this, you get an error page instead, which did not contain any links with "checkpart.aspx?Serial=", so your for loop was not being executed.

Here is how I've setup the start_request, and then fetch_search does what start_request used to do.

class MySpider(CrawlSpider):
    name = u"el_ahram2"

    def start_requests(self):
        search_term = u'اقتصاد'
        baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
        SearchPage = Request(baseUrl, callback = self.fetch_search)
        return [SearchPage]

    def fetch_search(self, response):
        sel = Selector(response)
        search_term = u'اقتصاد'
        baseUrl = u'http://digital.ahram.org.eg/sresult.aspx?srch=' + search_term + u'&archid=1'
        viewstate = sel.xpath("//input[@id='__VIEWSTATE']/@value").extract().pop()
        eventvalidation = sel.xpath("//input[@id='__EVENTVALIDATION']/@value").extract().pop()
        for i in range(1, 4):#crawl first 3 pages as a test
            argument =  u"'Page$"+ str(i+1) + u"'"
            data = {'__EVENTTARGET': u"'GridView1'", '__EVENTARGUMENT': argument, '__VIEWSTATE': viewstate, '__EVENTVALIDATION': eventvalidation}
            currentPage = FormRequest(baseUrl, formdata = data, callback = self.fetch_articles)
            yield currentPage

    ...
Operative answered 18/6, 2014 at 14:28 Comment(1)
Thank you that solved it! The code works perfectly now!Quest
C
1
 def fetch_articles(self, response):
    sel = Selector(response)
    print response._get_body() # you can write to file and do an grep 
    for ref in sel.xpath("//a[contains(@href,'checkpart.aspx?Serial=')]/@href").extract(): 
        yield Request('http://digital.ahram.org.eg/' + ref, callback=self.parse_items)

I could not find the "checkpart.aspx?Serial=" which you are searching for.

This might not solve your issue, but using answer instead of the comment for the code formatting.

Cabrales answered 30/5, 2014 at 12:19 Comment(1)
Thank you for the hint. I didn't know I could check the content of the response using _get_body(). You're right that checkpart.aspx?Serial= is not in the response. I wonder why... My guess it there is something wrong in the way the javascript form is filled. I don't know what exactly, though.Quest

© 2022 - 2024 — McMap. All rights reserved.