Python Feedparser: How can I check for new RSS data?
Asked Answered
P

2

22

I'm using the feedparser python library to pull RSS data from a feed continuously. I've written my python code in such a way that I can ask for a single instance of the RSS data. Here's my code currently:

import feedparser

rssPR = feedparser.parse('http://www.prnewswire.co.uk/rss/consumer-technology/wireless-      communications-news.rss')
rssDataList = []

for index, item in enumerate(rssPR.entries):
    rssDataList.append([item.published.encode('utf-8'), item.title.encode('utf-8')])

print rssDataList[0]  #for debugging purposes
print rssPR.modified #for testing purposes
  1. What can I add to my code so that it will only check for new RSS data if and only if the RSS has been modified?

  2. Let's say I have a list of 10 RSS items, and the RSS feed has been updated with 2 new RSS items. How can I only add those 2 items to the rssDataList I've created? I don't want to keep adding the same RSSs to my database.

Proconsul answered 5/3, 2014 at 23:39 Comment(0)
I
34

Regarding downloading only if/when the feed changed, you can use the HTTP header's ETag and as fallback also Last-Modified.

>>> feed.etag
'"6c132-941-ad7e3080"'
>>> feed.modified
'Fri, 11 Jun 2012 23:00:34 GMT'

You can specify them in your call to feedparser.parse. If they are still the same (no changes), the request will have the status code 304 (not modified).

It boils down to this example:

import feedparser
url = 'http://feedparser.org/docs/examples/atom10.xml'

# first request
feed = feedparser.parse(url)

# store the etag and modified
last_etag = feed.etag
last_modified = feed.modified

# check if new version exists
feed_update = feedparser.parse(url, etag=last_etag, modified=last_modified)

if feed_update.status == 304:
    # no changes

Notes: You need to check if feed.etag and feed.modified exists.

The feedparser library will automatically send the If-None-Match header with the provided etag parameter and If-Modified-Since with the modified value for you.

Source: Feedparser documentation about http and etag



 


To clarify the question asked in the comments:
This needs that the server supports either of those headers.

If neither header works, you can't use this, and have to always download the feed from the server, even if it's unchanged, as you simply can't tell before you downloaded it.

That means you have to download the feed every time, and store what entries you already seen.
If you want to not display stuff you already seen before (e.g. printing only the new ones) you have to keep a list of seen feeds anyway. Some feeds have an id field for each entry, which you can use in that case. Otherwise you have to be a bit creative to figure out what makes an entry the same, for your feed specifically.

Irrefutable answered 12/4, 2016 at 12:14 Comment(6)
What if none of the etag nor Last-Modified exists?Whisenhunt
@Whisenhunt Then you have to do the request every time.Irrefutable
@Whisenhunt I tried this. But the new RSS update will put out entire rss list and not only the "New" Ones. The question by specifically asks - "Let's say I have a list of 10 RSS items, and the RSS feed has been updated with 2 new RSS items. How can I ONLY add those 2 items to the rssDataList I've created? I don't want to keep adding the same RSSs to my database."Racial
I do suggest you to check the db and the fetched news and if the last item in db is the same as the one you are fetching then there is no new item.Whisenhunt
@DheerajMPai This answer addresses the first point, "check (...) if and only if the RSS has been modified". The second part isn't possible without data storage on your side or special server side handling.Irrefutable
@Irrefutable Oh ok. Cool. Just wanted to know if there is a function in feedparser that could do this on it's own. Looks like it isn't there. Thanks.Racial
I
0

Well there are a lot of different ways to tackle this. One of the easiest IMO would be to generate a unique "hash" of the most recent entry. For example:

import hashlib
import feedparser

rssPR = feedparser.parse('http://www.prnewswire.co.uk/rss/consumer-technology/wireless-communications-news.rss')
rssDataList = []

# Generate MD5 hash of the most current item's title and link elements.
lasthash = hashlib.md5(rssPR.entries[0].link + rssPR.entries[0].title).hexdigest()

for index, item in enumerate(rssPR.entries):
    rssDataList.append([item.published.encode('utf-8'), item.title.encode('utf-8')])

print rssPR.modified # Thu, 06 Mar 2014 00:13:50 GMT
print lasthash # 4167402f1ba2629fcc71003121aa1d25

Then if you do something like so:

rssCheck = feedparser.parse('http://www.prnewswire.co.uk/rss/consumer-technology/wireless-communications-news.rss')
thishash = hashlib.md5(rssCheck.entries[0].link + rssCheck.entries[0].title).hexdigest()

lasthash == thishash
>> True # up to date

This way, whenever you check the feed again if the hash is different you know it's been updated. Saves the headache of doing time/date comparisons.

Intubate answered 6/3, 2014 at 0:22 Comment(1)
Generating hashes will come back and bit you pretty agressively, because a lot of data in a given entry will change over time without real meaning. For example many links include tracking codes.. etc. Other feeds will have a timestamp that's continusously updated every time you fetch the feed...Howenstein

© 2022 - 2024 — McMap. All rights reserved.