I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:
{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}
Here is my code:
import json
import time
import urllib2
def get_submissions(after=None):
url = 'http://reddit.com/r/all/new.json?limit=100'
if after:
url += '&after=%s' % after
_user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
_request = urllib2.Request(url, headers={'User-agent': _user_agent})
_json = json.loads(urllib2.urlopen(_request).read())
return [story for story in _json['data']['children']], _json['data']['after']
if __name__ == '__main__':
after = None
stories = []
limit = 1
while len(stories) < limit:
new_stories, after = get_submissions(after)
stories.extend(new_stories)
time.sleep(2) # The Reddit API allows one request every two seconds.
print '%d stories collected so far .. sleeping for two seconds.' % len(stories)
What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.
Here's an example page from the API.
What's the deal?
EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.
{"kind": "Listing", "data": {"modhash": "", "children": [], "after": null, "before": null}}
. Are you sure you are using the API properly? – Scleroprotein