Why does Python say this Netscape cookie file isn't valid?
Asked Answered
C

3

9

I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt file:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

and this is the code I'm using to grab the HTML:

import http.cookiejar
import urllib.request, urllib.parse, urllib.error

def get_page(url, headers="", params=""):
    filename = "cookies.txt"
    request = urllib.request.Request(url, None, headers, params)
    cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
    cookies.load()
    cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
    redirect_handler = urllib.request.HTTPRedirectHandler()
    opener = urllib.request.build_opener(redirect_handler,cookie_handler)
    response = opener.open(request)
    return response

start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
                                 'q': '"' + search + '"',
                                 'btnG' : "",
                                 'hl' : 'en',
                                 'num': results_per_fetch,
                                 'as_sdt' : '1,14'})

url = base_url + "?" + params
resp = get_page(host + url, headers, params)

The full traceback is:

Traceback (most recent call last):
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
    resp = get_page(host + url, headers, params)
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
    cookies.load()
  File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
    filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file

I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.

Cot answered 17/7, 2012 at 19:24 Comment(3)
netscape cookie specification that used to be hosted at netscape.com before someone (AOL?) ruined history.Deadwood
an updated spec as rfc2965 with Set-Cookie2Deadwood
For anyone interested, actually you can do 'cookies.save(cookie_file, ignore_discard=True, ignore_expires=True)' to create a valid cookie file as instance to compare with invalid cookies.txt. Line by line or bye by byte to compare, and remove the line one by one, you would found the reason eventually.Karmakarmadharaya
J
13

I see nothing in your example code or copy of the cookies.txt file that is obviously wrong.

I've checked the source code for the MozillaCookieJar._really_load method, which throws the exception that you see.

The first thing this method does, is read the first line of the file you specified (using f.readline()) and use re.search to look for the regular expression pattern "#( Netscape)? HTTP Cookie File". This is what fails for your file.

It certainly looks like your cookies.txt would match that format, so the error you see is quite surprising.

Note that your file is opened with a simple open(filename) call earlier on, so it'll be opened in text mode with universal line ending support, meaning it doesn't matter that you are running this on Windows. The code will see \n newline terminated strings, regardless of what newline convention was used in the file itself.

What I'd do in this case is triple-check that your file's first line is really correct. It needs to either contain "# HTTP Cookie File" or "# Netscape HTTP Cookie File" (spaces only, no tabs, between the words, capitalisation matching). Test this with the python prompt:

>>> f = open('cookies.txt')
>>> line = f.readline()
>>> line
'# Netscape HTTP Cookie File\n'
>>> import re
>>> re.search("#( Netscape)? HTTP Cookie File", line)
<_sre.SRE_Match object at 0x10fecfdc8>

Python echoed the line representation back to me when I typed line at the prompt, including the \n newline character. Any surprises like tab characters or unicode zero-width spaces will show up there as escape codes. I also verified that the regular expression used by the cookiejar code matches.

You can also use the pdb python debugger to verify what the http.cookiejar module really does:

>>> import pdb
>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> <string>(1)<module>()
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
-> def load(self, filename=None, ignore_discard=False, ignore_expires=False):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
-> if filename is None:
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
-> if self.filename is not None: filename = self.filename
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
-> f = open(filename)
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
-> try:
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
-> self._really_load(f, filename, ignore_discard, ignore_expires)
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
-> def _really_load(self, f, filename, ignore_discard, ignore_expires):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
-> now = time.time()
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
-> magic = f.readline()
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
-> if not self.magic_re.search(magic):
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
-> try:

In the above sample pdb session I used a combination of the step and next commands to verify that the regular expression test (self.magic_re.search(magic)) actually passed.

Jardine answered 18/7, 2012 at 7:47 Comment(4)
Excellent answer! (and a fine example of debugging python as well). I don't know if something else has changed on my system, but the code works at the moment without any alterations to cookies.txt The first line of the file was identical to yours, including spaces, tabs, etc. so I'm not sure what issue was sparking the problem.Cot
@RicardoAltamirano just a guess: a text encoding change, eg. leading BOM like the unofficial utf-8 \xef\xbb\xbf could cause such effect and may be not very obvious as only the binary contents have changed but as text it may appear the same. The same idea could change in code if previously you used open and later codecs.open.Deadwood
@naxa: the open() and f.readline() session I show in my answer would (on Python 2), easily show any such codepoints. IIRC a UTF-8 BOM would still be part of the Unicode value returned from a codecs.open() or io.open() file object, and the telltale u' Unicode string literals would be a dead giveaway in any case.Jardine
Out of guesses then! Except, although this is cargo cult, but its worth to check if disk space is 0, it is usually unexpected and may lead to something strange.Deadwood
S
8

please this in your dev console

copy('# Netscape HTTP Cookie File\n' + document.cookie.split(/; /g).map(e => e.replace('=', '\t')).map(e => window.location.hostname.replace('www.', '.') + '\tTRUE\t/\tFALSE\t-1\t' + e).join('\n'))

Netscape-formatted cookies will be in your system's clipboard :)

Steels answered 19/4, 2020 at 19:57 Comment(1)
To make the copied cookie work with youtube-dl, I changed the last map to: .map(e => window.location.hostname.replace('www.', '') + '\tFALSE\t/\tTRUE\t0\t' + e). Your answer might be unrelated to the question but it was exactly what I wanted! Since I wasn't going to install a web extension just for this task, I was getting prepared to write my own tool for this. But you saved me from that, thank you :)Theorbo
G
3

As of my scenario, two modifications are needed to the MozillaCookieJar under (/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/)

  1. The magic header

    You can remove the check logic or add that magic header which I prefer

    # Netscape HTTP Cookie File

  2. The new file format seems allow you to omit the expires

    vals = line.split("\t")
    if len(vals) == 7 :
        domain, domain_specified, path, secure, expires, name, value = vals
    if len(vals) == 6 :
        domain, domain_specified, path, secure, name, value = vals
        expires = None
    

Lastly I really hope the implementation could be updated to the new changes.

Gerek answered 19/4, 2015 at 6:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.