Python - urllib2 & cookielib
Asked Answered
S

3

22

I am trying to open the following website and retrieve the initial cookie and use it for the second url-open BUT if you run the following code it outputs 2 different cookies. How do I use the initial cookie for the second url-open?

import cookielib, urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

home = opener.open('https://www.idcourts.us/repository/start.do')
print cj

search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj

Output shows 2 different cookies every time as you can see:

<cookielib.CookieJar[<Cookie JSESSIONID=0DEEE8331DE7D0DFDC22E860E065085F for www.idcourts.us/repository>]>
<cookielib.CookieJar[<Cookie JSESSIONID=E01C2BE8323632A32DA467F8A9B22A51 for www.idcourts.us/repository>]>
Stormi answered 3/1, 2011 at 8:15 Comment(0)
S
21

This is not a problem with urllib. That site does some funky stuff. You need to request a couple of stylesheets for it to validate your session id:

import cookielib, urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# default User-Agent ('Python-urllib/2.6') will *not* work
opener.addheaders = [
    ('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11'),
    ]


stylesheets = [
    'https://www.idcourts.us/repository/css/id_style.css',
    'https://www.idcourts.us/repository/css/id_print.css',
]

home = opener.open('https://www.idcourts.us/repository/start.do')
print cj
sessid = cj._cookies['www.idcourts.us']['/repository']['JSESSIONID'].value
# Note the +=
opener.addheaders += [
    ('Referer', 'https://www.idcourts.us/repository/start.do'),
    ]
for st in stylesheets:
    # da trick
    opener.open(st+';jsessionid='+sessid)
search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj
# perhaps need to keep updating the referer...
Smoko answered 4/1, 2011 at 0:48 Comment(3)
It's working now :) I had left the opener.addheaders dangling in my ipython session. That code should work as-is (works for me on python 2.6 on a mac at least)Smoko
The code I posted is not robust. Sometimes the session will stick, other times it won't. My guess is that there's something implemented server-side to discourage non-human access (ie: rather strict session invalidation policies)Smoko
How did you come to conclude this:"You need to request a couple of stylesheets for it to validate your session id:". I would like to learn how.Vickers
P
7

Not an actual answer (but far too long for a comment); possibly useful to anyone else trying to answer this.

Despite my best attempts, I can't figure this out.

Looking in Firebug, the cookie seems to remain the same (works properly) for Firefox.

I added urllib2.HTTPSHandler(debuglevel=1) to debug what headers Python is sending, and it does appear to resend the cookie.

I also added all the Firefox request headers to see if that would help (it didn't):

opener.addheaders = [
    ('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
    ..
]

My test code:

import cookielib, urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), urllib2.HTTPSHandler(debuglevel=1))
opener.addheaders = [
    ('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13'),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Language', 'en-gb,en;q=0.5'),
    ('Accept-Encoding', 'gzip,deflate'),
    ('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'),
    ('Keep-Alive', '115'),
    ('Connection', 'keep-alive'),
    ('Cache-Control', 'max-age=0'),
    ('Referer', 'https://www.idcourts.us/repository/partySearch.do'),
]

home = opener.open('https://www.idcourts.us/repository/start.do')
print cj

search = opener.open('https://www.idcourts.us/repository/partySearch.do')
print cj

I feel like I'm missing something obvious.

Priest answered 3/1, 2011 at 9:37 Comment(1)
there could be some nasty javascript on the page.Cale
C
0

I think, it is a problem with the server it is Setting a new cookie for each request.

Cycloparaffin answered 3/1, 2011 at 9:4 Comment(1)
It doesn't do it when you browse from an actual browser though...that's the weird thing.Stormi

© 2022 - 2024 — McMap. All rights reserved.