Reddit API returning useless JSON
Asked Answered
C

3

6

I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:

{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}

Here is my code:

import json
import time
import urllib2

def get_submissions(after=None):
    url = 'http://reddit.com/r/all/new.json?limit=100'
    if after:
        url += '&after=%s' % after

    _user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
    _request = urllib2.Request(url, headers={'User-agent': _user_agent})
    _json = json.loads(urllib2.urlopen(_request).read())   

    return [story for story in _json['data']['children']], _json['data']['after']

if __name__ == '__main__':
    after = None
    stories = []
    limit = 1
    while len(stories) < limit:
        new_stories, after = get_submissions(after)
        stories.extend(new_stories)
        time.sleep(2) # The Reddit API allows one request every two seconds.
        print '%d stories collected so far .. sleeping for two seconds.' % len(stories)

What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.

Here's an example page from the API.

What's the deal?

EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.

Chimera answered 11/11, 2012 at 5:18 Comment(7)
The API link provided gives me the same data, {"kind": "Listing", "data": {"modhash": "", "children": [], "after": null, "before": null}}. Are you sure you are using the API properly?Scleroprotein
Are you sure you aren't printing out your parsed JSON data?Mantelpiece
@Scleroprotein -- That's strange, because I get this. I had to change the limit to 10 because 100 waste too large for pastie.Chimera
@Mantelpiece -- That comes from printing _json.Chimera
@JamesBrewer: I have no idea why, I haven't used this API before. Hopefully someone else can shed light on it.Scleroprotein
@JamesBrewer: Exactly; it parsed successfully, and you're printing the Python representation.Mantelpiece
@Scleroprotein -- I tried loading the page in another browser and now I'm getting that result as well. Maybe the API is down or something. It was working fine earlier.Chimera
S
3

Edit: As of 2013/02/22, the desired new sort no longer requires sort=new to be added as a URL parameter. This is because the rising sort is no longer provided under the /new route, but is provided by /rising [source].


The problem with the URL http://reddit.com/r/all/new.json?limit=100 is that the new pages by default use the rising sort. If you are logged in, and you have changed the default sort to new then what you really see is the result for the page http://reddit.com/r/all/new.json?limit=100&sort=new. Notice the addition of the parameter sort=new.

Thus the result is correct, it is just that the rising view has not been updated for /r/all.

On a related note, I strongly suggest using PRAW (the python reddit API wrapper) rather than writing your own code if you plan to use more than just a single part of the API. Here's the relevant code that you want:

import praw
r = praw.Reddit('YOUR DESCRIPTIVE USER AGENT NAME')
listing = list(r.get_subreddit('all').get_new_by_date())
print listing

If you simply want to iterate over the submissions you can omit the list() part.

Seely answered 11/11, 2012 at 7:36 Comment(7)
How do I get around PRAW only returning 1000 results before stopping?Chimera
The 1000 item limit is a reddit limitation not a PRAW limitation. The only exception (I am aware of) is for the /r/all/new?sort=new listing shown above. I just confirmed that using r.get_subreddit('all').get_new_by_date(limit=2000) will in fact fetch 2000 items. Replace limit=2000 with limit=None to continue back to the beginning of reddit.Seely
I love you @bboe! I have been hitting my head against a wall for two days already trying to figure out why I can only get the first ~800 listings for a subreddit. Why could I only find this in a comment on SO, and not in the docs?!Menhaden
If you mean the 1000 item limit, it is mentioned in 2 places in the docs: praw.readthedocs.org/en/latest/pages/… and praw.readthedocs.org/en/latest/pages/…Seely
I didn't read the praw docs because I'm not using praw... Maybe I should be.. Thanks!Menhaden
Oh you meant the reddit API documentation. Perhaps I'll add it.Seely
A "Non-obvious behavior" addition to the Reddit API Docs would be most welcomed =)Menhaden
G
0

I was stumped on a similar (not the same as OP) problem for a while - no children in the API response. I figured I'd post this in case it's helpful to others getting to this question via a search engine:

If I open this url in my browser:

https://www.reddit.com/comments.json?limit=100

It seems to work fine, but when I send a request it returns no children. Tried playing with the user-agent of the request and stuff like that to no avail. Ended up using the /r/all comment stream instead:

https://www.reddit.com/r/all/comments.json?limit=100

Works fine in the browser and via a programmatic request. Still have no idea why the first url doesn't work.

Gunderson answered 8/3, 2018 at 23:6 Comment(0)
B
-1

http://www.reddit.com/r/all.json?limit=100 returns meaningful data

http://reddit.com/r/all/new?limit=100 (no .json) says there are no items...

It looks like reddit doesn't use /new how you think it does so the problem is in your use of the api.

If this answer is not sufficient please include a link to the reddit api docs.

Also, here's a quick note on REST. It looks like reddit is RESTful (I stand to be corrected but that's what my experiments here tell me...). This means that by dropping the .json extension on any of the urls you are trying to access should give you human-friendly versions of the same data. This could be useful during testing. Just look at stuff with your browser and you will see what info reddit thinks you are asking for.

Bywoods answered 11/11, 2012 at 7:20 Comment(3)
Those two URLs point to completely different data. The first one is for "hot" posts in /r/all, and the second is for "new" posts.Gibber
no. the second one points to no posts at all... of course they are different.Bywoods
yeah. I saw it. I answered before him so I didn't see it. It is correct, I was pointing out that the source of the error was definitely not Python and giving a tip for debugging such issues in the future.Bywoods

© 2022 - 2024 — McMap. All rights reserved.