I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt
file:
# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2
.google.com TRUE / FALSE 1317124758 PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID=f3f18b3b5a7c2647:CF=2
.google.co.uk TRUE / FALSE 1317125123 PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN
and this is the code I'm using to grab the HTML:
import http.cookiejar
import urllib.request, urllib.parse, urllib.error
def get_page(url, headers="", params=""):
filename = "cookies.txt"
request = urllib.request.Request(url, None, headers, params)
cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
cookies.load()
cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
redirect_handler = urllib.request.HTTPRedirectHandler()
opener = urllib.request.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
return response
start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
'q': '"' + search + '"',
'btnG' : "",
'hl' : 'en',
'num': results_per_fetch,
'as_sdt' : '1,14'})
url = base_url + "?" + params
resp = get_page(host + url, headers, params)
The full traceback is:
Traceback (most recent call last):
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
resp = get_page(host + url, headers, params)
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
cookies.load()
File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
self._really_load(f, filename, ignore_discard, ignore_expires)
File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file
I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.