I'm trying to crawl Google Scholar search results and get all the BiBTeX format of each result matching the search. Right now I have a Scrapy crawler with Splash. I have a lua script which will click the "Cite" link and load up the modal window before getting the href
of the BibTeX format of the citation. But seeing that there are multiple search results and hence multiple "Cite" links, I need to click them all and load up the individual BibTeX pages.
Here's what I have:
import scrapy
from scrapy_splash import SplashRequest
class CiteSpider(scrapy.Spider):
name = "cite"
allowed_domains = ["scholar.google.com", "scholar.google.ae"]
start_urls = [
'https://scholar.google.ae/scholar?q="thermodynamics"&hl=en'
]
script = """
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[0].click()')
splash:wait(3)
local href = splash:evaljs('document.querySelectorAll(".gs_citi")[0].href')
assert(splash:go(href))
return {
html = splash:html(),
png = splash:png(),
href=href,
}
end
"""
def parse(self, response):
yield SplashRequest(self.start_urls[0], self.parse_bib,
endpoint="execute",
args={"lua_source": self.script})
def parse_bib(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.css("body > pre::text").extract()[0])
I'm thinking I should pass the index of the "Cite" link into the lua script when I perform the querySelectorAll
call but I can't seem to find a way to pass another variable into the function. Also I assume I'll have to do some dirty javascript history.back()
to return back to the original results page after getting the BibTeX but I feel there's a more elegant way to handle this.
SplashRequest(..., args={"lua_source": script, "n": n})
, and then access it from a script assplash.args.n
. This way string formatting won't be needed. String formatting has several disadvatnages - it is more code, you need to escape values to be valid Lua (n/a for integer variables), and it doesn't play well with caching (Splash can cache scripts, so that there is no need to send a script with each request). – Harass