Splash lua script to do multiple clicks and visits
Asked Answered
J

1

6

I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages.

Here's what I have:

import scrapy
from scrapy_splash import SplashRequest


class CiteSpider(scrapy.Spider):
    name = "cite"
    allowed_domains = ["scholar.google.com", "scholar.google.ae"]
    start_urls = [
        'https://scholar.google.ae/scholar?q="thermodynamics"&hl=en'
    ]

    script = """
        function main(splash)
          local url = splash.args.url
          assert(splash:go(url))
          assert(splash:wait(0.5))
          splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[0].click()')
          splash:wait(3)
          local href = splash:evaljs('document.querySelectorAll(".gs_citi")[0].href')
          assert(splash:go(href))
          return {
            html = splash:html(),
            png = splash:png(),
            href=href,
          }
        end
        """

    def parse(self, response):
        yield SplashRequest(self.start_urls[0], self.parse_bib,
                            endpoint="execute",
                            args={"lua_source": self.script})

    def parse_bib(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.css("body > pre::text").extract()[0])

I'm thinking I should pass the index of the "Cite" link into the lua script when I perform the querySelectorAll call but I can't seem to find a way to pass another variable into the function. Also I assume I'll have to do some dirty javascript history.back() to return back to the original results page after getting the BibTeX but I feel there's a more elegant way to handle this.

Julee answered 26/6, 2016 at 22:11 Comment(0)
J
6

Okay so I hacked up a solution which works. First of all we'll need the Lua script to be mutable so we'll make it a function:

def script(n):
    _script = """
        function main(splash)
          local url = splash.args.url
          local href = ""
          assert(splash:go(url))
          assert(splash:wait(0.5))
          splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[{}].click()')
          splash:wait(3)
          href = splash:evaljs('document.querySelectorAll("a.gs_citi")[0].href')
          assert(splash:go(href))
          return {}
        end
        """.format(n, "{html=splash:html(),png=splash:png(), href=href,}")
    return _script

I then had to modify the parse function so that it clicks all the "Cite" links on the page. The way to do that is to iterate through all the matching "Cite" links on the page and to click on each one individually. I made the Lua script load the page again (which is dirty but I can't think of any other way) and click on the index of the queried "Cite" link. Also it has to make duplicate requests hence why the dont_filter=True is there:

def parse(self, response):
        n = len(response.css("a.gs_nph[aria-controls=gs_cit]").extract())
        for i in range(n):
            yield SplashRequest(response.url, self.parse_bib,
                                endpoint="execute",
                                args={"lua_source": script(i)},
                                dont_filter=True)

Hope this helps.

Julee answered 30/6, 2016 at 10:32 Comment(2)
FYI: you can pass 'n' argument in SplashRequest(..., args={"lua_source": script, "n": n}), and then access it from a script as splash.args.n. This way string formatting won't be needed. String formatting has several disadvatnages - it is more code, you need to escape values to be valid Lua (n/a for integer variables), and it doesn't play well with caching (Splash can cache scripts, so that there is no need to send a script with each request).Harass
@Syafiq This is awesome. This looping/dynamic script solution worked for me crawling an AngularJS site through links from a main index page. I had to add a Python sleep and Lua code to authenticate my robot, but it worked really well and fixed my timeout issues (crawling 30+ links in one go).Ragouzis

© 2022 - 2024 — McMap. All rights reserved.