Reddit API returning useless JSON

Asked 11/11, 2012 at 5:18 Answered 8/3, 2018 at 23:6

I'm trying to scrape new stories from Reddit using their API and Python's urllib2, but I keep getting JSON documents like this one:

{ u'kind': u'Listing', u'data': { u'modhash': u'', u'children': [], u'after': None, u'before': None }}

Here is my code:

import json
import time
import urllib2

def get_submissions(after=None):
    url = 'http://reddit.com/r/all/new.json?limit=100'
    if after:
        url += '&after=%s' % after

    _user_agent = 'Reddit Link Analysis Bot by PirateLogic @ github.com/jamesbrewer'
    _request = urllib2.Request(url, headers={'User-agent': _user_agent})
    _json = json.loads(urllib2.urlopen(_request).read())   

    return [story for story in _json['data']['children']], _json['data']['after']

if __name__ == '__main__':
    after = None
    stories = []
    limit = 1
    while len(stories) < limit:
        new_stories, after = get_submissions(after)
        stories.extend(new_stories)
        time.sleep(2) # The Reddit API allows one request every two seconds.
        print '%d stories collected so far .. sleeping for two seconds.' % len(stories)

What I've written is fairly short and straight-forward, but I'm obviously overlooking something or I don't have a complete understanding of the API or how urllib2 works.

Here's an example page from the API.

What's the deal?

EDIT After trying to load the example page in another browser, I'm also seeing the JSON I posted at the top of the page. It seems to be only for //new.json though. If I try //hot.json or just /.json, I get what I want.

Chimera answered 11/11, 2012 at 5:18 Comment(7)

The API link provided gives me the same data, {"kind": "Listing", "data": {"modhash": "", "children": [], "after": null, "before": null}}. Are you sure you are using the API properly? – Scleroprotein 11/11, 2012 at 5:33

Are you sure you aren't printing out your parsed JSON data? – Mantelpiece 11/11, 2012 at 5:34

@Scleroprotein -- That's strange, because I get this. I had to change the limit to 10 because 100 waste too large for pastie. – Chimera 11/11, 2012 at 5:48

@Mantelpiece -- That comes from printing _json. – Chimera 11/11, 2012 at 5:49

@JamesBrewer: I have no idea why, I haven't used this API before. Hopefully someone else can shed light on it. – Scleroprotein 11/11, 2012 at 5:56

@JamesBrewer: Exactly; it parsed successfully, and you're printing the Python representation. – Mantelpiece 11/11, 2012 at 6:2

@Scleroprotein -- I tried loading the page in another browser and now I'm getting that result as well. Maybe the API is down or something. It was working fine earlier. – Chimera 11/11, 2012 at 6:35

Edit: As of 2013/02/22, the desired new sort no longer requires sort=new to be added as a URL parameter. This is because the rising sort is no longer provided under the /new route, but is provided by /rising [source].

The problem with the URL http://reddit.com/r/all/new.json?limit=100 is that the new pages by default use the rising sort. If you are logged in, and you have changed the default sort to new then what you really see is the result for the page http://reddit.com/r/all/new.json?limit=100&sort=new. Notice the addition of the parameter sort=new.

Thus the result is correct, it is just that the rising view has not been updated for /r/all.

On a related note, I strongly suggest using PRAW (the python reddit API wrapper) rather than writing your own code if you plan to use more than just a single part of the API. Here's the relevant code that you want:

import praw
r = praw.Reddit('YOUR DESCRIPTIVE USER AGENT NAME')
listing = list(r.get_subreddit('all').get_new_by_date())
print listing

If you simply want to iterate over the submissions you can omit the list() part.

Seely answered 11/11, 2012 at 7:36 Comment(7)

How do I get around PRAW only returning 1000 results before stopping? – Chimera 11/11, 2012 at 15:11

The 1000 item limit is a reddit limitation not a PRAW limitation. The only exception (I am aware of) is for the /r/all/new?sort=new listing shown above. I just confirmed that using r.get_subreddit('all').get_new_by_date(limit=2000) will in fact fetch 2000 items. Replace limit=2000 with limit=None to continue back to the beginning of reddit. – Seely 11/11, 2012 at 17:43

I love you @bboe! I have been hitting my head against a wall for two days already trying to figure out why I can only get the first ~800 listings for a subreddit. Why could I only find this in a comment on SO, and not in the docs?! – Menhaden 21/3, 2013 at 2:56

If you mean the 1000 item limit, it is mentioned in 2 places in the docs: praw.readthedocs.org/en/latest/pages/… and praw.readthedocs.org/en/latest/pages/… – Seely 21/3, 2013 at 18:10

I didn't read the praw docs because I'm not using praw... Maybe I should be.. Thanks! – Menhaden 21/3, 2013 at 19:29

Oh you meant the reddit API documentation. Perhaps I'll add it. – Seely 21/3, 2013 at 23:32

A "Non-obvious behavior" addition to the Reddit API Docs would be most welcomed =) – Menhaden 22/3, 2013 at 1:4

I was stumped on a similar (not the same as OP) problem for a while - no children in the API response. I figured I'd post this in case it's helpful to others getting to this question via a search engine:

If I open this url in my browser:

https://www.reddit.com/comments.json?limit=100

It seems to work fine, but when I send a request it returns no children. Tried playing with the user-agent of the request and stuff like that to no avail. Ended up using the /r/all comment stream instead:

https://www.reddit.com/r/all/comments.json?limit=100

Works fine in the browser and via a programmatic request. Still have no idea why the first url doesn't work.

Gunderson answered 8/3, 2018 at 23:6 Comment(0)

-1

http://www.reddit.com/r/all.json?limit=100 returns meaningful data

http://reddit.com/r/all/new?limit=100 (no .json) says there are no items...

It looks like reddit doesn't use /new how you think it does so the problem is in your use of the api.

If this answer is not sufficient please include a link to the reddit api docs.

Also, here's a quick note on REST. It looks like reddit is RESTful (I stand to be corrected but that's what my experiments here tell me...). This means that by dropping the .json extension on any of the urls you are trying to access should give you human-friendly versions of the same data. This could be useful during testing. Just look at stuff with your browser and you will see what info reddit thinks you are asking for.

Bywoods answered 11/11, 2012 at 7:20 Comment(3)

Those two URLs point to completely different data. The first one is for "hot" posts in /r/all, and the second is for "new" posts. – Gibber 11/11, 2012 at 10:6

no. the second one points to no posts at all... of course they are different. – Bywoods 11/11, 2012 at 10:44

yeah. I saw it. I answered before him so I didn't see it. It is correct, I was pointing out that the source of the error was definitely not Python and giving a tip for debugging such issues in the future. – Bywoods 11/11, 2012 at 15:36

Recommended topics

Hot tags