Python etag/last modified not working; how to get latest rss
Asked Answered
P

2

6

I'm trying to write a python program that will grab and display any rss updates since the last time the program was run. I am using feedparser and trying to use etags and last modified as described here on SO but my test script seems to not be working.

import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss","http://mrjakeparker.com/feed/"]
for feed in rsslist:
print('--------'+feed+'-------')
d=feedparser.parse(feed)
print(len(d.entries))
if (len(d.entries) > 0):
    etag=d.feed.get('etag','')
    modified=d.get('modified',d.get('updated',d.entries[0].get('published','no modified,update or published fields present in rss')))

    d2=feedparser.parse(feed,modified)
    if (len(d2.entries) > 0):
        etag2=d2.feed.get('etag','')
        modified2=d2.get('updated',d.entries[0].get('published',''))

    if (d2==d): #ideally we would never see this bc etags/last modified would prevent unnecessarily downloading what we all ready have.
        print("Arrg these are the same")

I'm honestly not sure if rss/xml technology has changed from the references I've been using online or if there is a problem with my code.

Regardless I'm looking for a best solution to efficiently use rss feeds. As it stands I'm looking to minimize bandwidth waste such as that which is intended by use of last-modified and the etags fields.

Thanks in advance.

Paredes answered 8/11, 2012 at 20:16 Comment(3)
The documentation says to use feed.etag. I don't know if it really matters though.Crutcher
@NathanVillaescusa no it shouldn't matter. I'm using d.feed.get('etag','') as a way to handle errors. As it is I do it this way because none of the examples I use seem to return an etag.Paredes
Ah, I thought it might be something like that. The first URL in your list does not have an etag in the response headers, the second one does.Crutcher
C
8

Your issue is that you are passing in the last modified date in place of the etag. The etag is the second argument to the parse() method, modified is the third argument.

Instead of:

d2=feedparser.parse(feed,modified)

Do:

d2=feedparser.parse(feed,modified=modified)

After taking a look at the source code, it looks like the only thing passing etag or modified to the parse() function does is send the appropriate headers to the server so that the server can return an empty response if nothing has changed. If the server does not support this then the server will just return the full RSS feed. I would modify your code to check the dates of each entry and ignore one with a date that is smaller than the max date in the previous request:

import feedparser
rsslist=["http://skottieyoung.tumblr.com/rss", "http://mrjakeparker.com/feed/"]

def feed_modified_date(feed):
    # this is the last-modified value in the response header
    # do not confuse this with the time that is in each feed as the server
    # may be using a different timezone for last-resposne headers than it 
    # uses for the publish date

    modified = feed.get('modified')
    if modified is not None:
        return modified

    return None

def max_entry_date(feed):
    entry_pub_dates = (e.get('published_parsed') for e in feed.entries)
    entry_pub_dates = tuple(e for e in entry_pub_dates if e is not None)

    if len(entry_pub_dates) > 0:
        return max(entry_pub_dates)    

    return None

def entries_with_dates_after(feed, date):
    response = []

    for entry in feed.entries:
        if entry.get('published_parsed') > date:
            response.append(entry)

    return response            

for feed_url in rsslist:
    print('--------%s-------' % feed_url)
    d = feedparser.parse(feed_url)
    print('feed length %i' % len(d.entries))

    if len(d.entries) > 0:
        etag = d.feed.get('etag', None)
        modified = feed_modified_date(d)
        print('modified at %s' % modified)

        d2 = feedparser.parse(feed_url, etag=etag, modified=modified)
        print('second feed length %i' % len(d2.entries))
        if len(d2.entries) > 0:
            print("server does not support etags or there are new entries")
            # perhaps the server does not support etags or last-modified
            # filter entries ourself

            prev_max_date = max_entry_date(d)

            entries = entries_with_dates_after(d2, prev_max_date)

            print('%i new entries' % len(entries))
        else:
            print('there are no entries')

This produces:

--------http://skottieyoung.tumblr.com/rss-------
feed length 20
modified at None
second feed length 20
server does not support etags or there are new entries
0 new entries
--------http://mrjakeparker.com/feed/-------
feed length 10
modified at Wed, 07 Nov 2012 19:27:48 GMT
second feed length 0
there are no entries
Crutcher answered 8/11, 2012 at 21:28 Comment(4)
I guess I was unclear in my problem description. If you run my code you won't get back an etag. Therefore I tried the second method using the modified tag. This however doesn't seem to get me the desired result either. The documentation seems to show that I'm not getting these tags from the server. I am guessing modified is a part of the rss. The docs on etags seem to say that etags come in http header. So I guess the etag isn't being sent?Paredes
The server for your first URL is not sending an etag, the second one is. You can check by opening the URL in your browser and looking at the response headers.Crutcher
I've updated my response, I think that should get you started.Crutcher
Thank you ever so much for your time.Paredes
C
0

I would suggest using the Date in the header as a fallback if there is no etag or modified information in the feed.

Use feed['headers']['Date'] which can be used like this.

feedparser.parse(url, modified=feed['headers']['Date'])

Edit: But it looks like that some servers ignoring the modified parameter.

Collinsworth answered 21/3, 2018 at 21:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.