Does urllib2.urlopen() cache stuff?
Asked Answered
S

5

14

They didn't mention this in python documentation. And recently I'm testing a website simply refreshing the site using urllib2.urlopen() to extract certain content, I notice sometimes when I update the site urllib2.urlopen() seems not get the newly added content. So I wonder it does cache stuff somewhere, right?

Systematize answered 27/8, 2010 at 16:34 Comment(4)
Web servers cache stuff, too. That's the usual culprit. Check the headers on the result, and update your question to include the info().Weatherspoon
@S.Lott: "Web servers cache stuff, too" Does it mean if I didn't get the updated results using urllib2.urlopen() that's mainly web servers "know" it's me refreshing and don't give me the updated stuff? Is there a way to force the server to transmit the data all over again every time I refresh the site?Systematize
Unless you know a lot about the web server, you don't really know what caches it has. It could have multiple levels of caching. It could have incorrectly configured cache. It could have pages that don't provide information to refresh cache. Much can go wrong on the server side.Weatherspoon
@S.Lott: Thanks a lot. So urllib2.urlopen() itself do not cache things on my computer side, right?Systematize
G
10

So I wonder it does cache stuff somewhere, right?

It doesn't.

If you don't see new data, this could have many reasons. Most bigger web services use server-side caching for performance reasons, for example using caching proxies like Varnish and Squid or application-level caching.

If the problem is caused by server-side caching, usally there's no way to force the server to give you the latest data.


For caching proxies like squid, things are different. Usually, squid adds some additional headers to the HTTP response (response().info().headers).

If you see a header field called X-Cache or X-Cache-Lookup, this means that you aren't connected to the remote server directly, but through a transparent proxy.

If you have something like: X-Cache: HIT from proxy.domain.tld, this means that the response you got is cached. The opposite is X-Cache MISS from proxy.domain.tld, which means that the response is fresh.

Guidebook answered 27/8, 2010 at 17:41 Comment(1)
Thanks, now I know what the problem is.Systematize
I
5

Very old question, but I had a similar problem which this solution did not resolve.
In my case I had to spoof the User-Agent like this:

request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0')
content = urllib2.build_opener().open(request)

Hope this helps anyone...

Incubator answered 4/4, 2012 at 9:18 Comment(1)
Thanks! Had the same issue when downloading JSON from a drupal feed. This may not have anything to do with your actual python script, but rather the server you're downloading data from. In our case that server cached content based on the user agent.Appointive
F
1

Your web server or an HTTP proxy may be caching content. You can try to disable caching by adding a Pragma: no-cache request header:

request = urllib2.Request(url)
request.add_header('Pragma', 'no-cache')
content = urllib2.build_opener().open(request)
Fleisig answered 22/8, 2013 at 14:33 Comment(0)
C
0

If you make changes and test the behaviour from browser and from urllib, it is easy to make a stupid mistake. In browser you are logged in, but in urllib.urlopen your app can redirect you always to the same login page, so if you just see the page size or the top of your common layout, you could think that your changes have no effect.

Circassian answered 7/7, 2016 at 7:39 Comment(0)
A
-2

I find it hard to believe that urllib2 does not do caching, because in my case, upon restart of the program the data is refreshed. If the program is not restarted, the data appears to be cached forever. Also retrieving the same data from Firefox never returns stale data.

Agnes answered 14/10, 2010 at 19:42 Comment(1)
urllib2 doesn't do caching. maybe you are using a proxy or the web application itself is storing temporary data.Wild

© 2022 - 2024 — McMap. All rights reserved.