Troubles using scrapy with javascript __doPostBack method
Asked Answered
A

1

3

Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form

http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting

As I click through the pages, after visiting this page, it changes slightly to

http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2

Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at this is defining a long list of start_urls in scrapy.

class websiteSpider(BaseSpider):
    name = "website"
    allowed_domains = ["website.com"]
    baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
    start_urls = [(baseUrl+str(i)) for i in range(1,1000)]

Currently this code simply ends up visiting the first page over and over again. I feel like this is probably straightforward, but I don't quite know how to get around this.

UPDATE: Made some progress investigating this and found that the site updates each page by sending a POST request to the previous page using __doPostBack(arg1, arg2). My question now is how exactly do I mimic this POST request using scrapy. I know how to make a POST request, but not exactly how to pass it the arguments I want.

SECOND UPDATE: I've been making a lot of progress! I think... I looked through examples and documentation and eventually slapped together this version of what I think should do the trick:

def start_requests(self):
    baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
    target = 'ctl00$empcnt$ucResults$pagination'
    requests = []
    for i in range(1, 5):
        url = baseUrl + str(i)
        argument = str(i+1)
        data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument}
        currentPage = FormRequest(url, data)
        requests.append(currentPage)
    return requests

The idea is that this treats the POST request just like a form and updates accordingly. However, when I actually try to run this I get the following traceback(s) (Condensed for brevity):

2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl()
dfd.addCallbacks(request.callback or spider.parse, request.errback)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks
        assert callable(callback)
    exceptions.AssertionError: 

2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred:
2013-03-22 04:03:03-0400 [-] Unhandled Error
    Traceback (most recent call last):
    Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen)

Changing question to be more directed at what this post has turned into.

Thoughts?

P.S. When the second errors happen scrapy is unable to cleany shutdown and I have to send a SIGINT twice to get things to actually wrap up.

Anchoveta answered 22/3, 2013 at 0:39 Comment(0)
P
2

FormRequest doesn't have a positional argument in the constructor for formdata:

class FormRequest(Request):
    def __init__(self, *args, **kwargs):
        formdata = kwargs.pop('formdata', None)

so you actually have to say formdata=:

requests.append(FormRequest(url, formdata=data))
Purulence answered 22/3, 2013 at 18:37 Comment(5)
Awesome! Thanks so much, but there is a deeper problem still. It now returns 404 errors uniformly. Any ideas?Anchoveta
404 means the server thinks the url is wrong, so what is the actual url? Usually if the POST data is wrong the server will give a 500-ish error, usually. Btw: You just bumped me up over the 1k mark, nice.Purulence
Also if you are dealing with .aspx, you may need to include the giant __VIEWSATE in the POST data as well.Purulence
the site is guru.com, and the really weird thing is that the responses I get to the post are a weird combination of 404 errors and 200 codes that give a response body of a 500 error.Anchoveta
Realized this is probably a different question at this point, so will re ask the relevant part and mark this accepted.Anchoveta

© 2022 - 2024 — McMap. All rights reserved.