Reddit search API not giving all results
Asked Answered
I

2

5
import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)
    title=[]
    for post in posts:
        title.append(post.title)
    print len(title)


search="stackoverflow"
get_data_reddit(search)
        

Ouput=953

Why the limitation?

  1. [Documentation][1] mentions

We can at most get 1000 results from every listing, this is an upstream limitation by reddit. There is nothing we can do to go past this limit. But we may be able to get the results we want with the search() method instead.

Any workaround? I hoping someway to overcome in API, I wrote an scraper for twitter data and find it to be not the most efficient solution.

Same Question:https://github.com/praw-dev/praw/issues/430 [1]: https://praw.readthedocs.org/en/v2.0.15/pages/faq.html Please refer the aformentioned link for related discussion too.

Industrious answered 23/6, 2015 at 10:56 Comment(2)
This is a relatively commonplace practice for API's to stop people from overloading the severs with requests. You can normally get around it by making your search queries more specific and looping through a defined set, e.g. keep the queries to a specific day, and loop through the last ten days, or whatever reddit will allow that can work in this way,Chavis
@Scironic Thanks! This seems a much better solution than a scrapper. Can you provide an example to elucidate. It would be greatly helpful. Maybe going through when reddit started to current time.Industrious
N
8

Limiting results on a search or list is a common tactic for reducing load on servers. The reddit API is clear that this is what it does (as you have already flagged). However it doesn't stop there...

The API also supports a variation of paged results for listings. Since it is a constantly changing database, they don't provide pages, but instead allow you to pick up where you left off by using the 'after' parameter. This is documented here.

Now, while I'm not familiar with PRAW, I see that the reddit search API conforms to the listing syntax. I think you therefore only need to reissue your search, specifying the extra 'after' parameter (referring to your last result from the first search).

Having subsequently tried it out, it appears PRAW is genuinely returning you all the results you asked for.

As requested by OP, here's the code I wrote to look at the paged results.

import praw

def get_data_reddit(search, after=None):
    r = praw.Reddit(user_agent='StackOverflow example')
    params = {"q": search}
    if after:
        params["after"] = "t3_" + str(after.id)
    posts = r.get_content(r.config['search'] % 'all', params=params, limit=100)
    return posts

search = "stackoverflow"
post = None
count = 0
while True:
    posts = get_data_reddit(search, post)
    for post in posts:
        print(str(post.id))
        count += 1
    print(count)
Nanosecond answered 27/6, 2015 at 21:12 Comment(16)
Thanks for reply! But can tell me how to achieve this is python.Industrious
Looking at the PRAW code, you just need to add 'after=<full name>' to you existing call to search(), where <full name> is constructed as per reddit.com/dev/api#fullnamesNanosecond
@Thanks for the reply again! I am sorry a naive to python. I don't understand what you exactly mean. How can I do what you have advised. Can you please edit it help me understand better.Industrious
Can you please elucidate more. Should the full name refer to the link of search url?Industrious
Can you please reply. I am still stucked.Industrious
OK, so I had a quick play and found that paging works, if you go to get_content instead. Then I found that search already does the paging for you. S then I repeated the search on reddit and paged to the end... Guess what? There were 953 results. The short answer is therefore that your search is complete.Nanosecond
Can you provide some code on how you did the paging. I am not sure if it paged it till the end. Check this reddit.com/…Industrious
Please provide some code, it helps better to cross check and understand your methodology.Industrious
Given that I've just been voted down twice (without any explanation), I'm not sure that this is worth following, but I'll have one last go... Your URL returns exactly the same results as the simple search with just the entries relabeled to start at 1,000. The count is just a mechanism to provide a consistent view to the user and not a definitive index.Nanosecond
I also think that there should be some explanation why this answer has been voted down.Wreckage
Sorry for downvoting quickly, I just got very confused. Thanks for pointing out my mistake with the count variable. If you check the question I had asked all results, I sincerely doubt Reddit has less than 1000 posts containing the word StackOverflow. I quickly tried a few other popular search entries as well. It seems Reddit always returns less thousand results.Industrious
But this doesn't answer my answer completely. It seems using some other search engine like google could be the only option. What do you suggest?Industrious
Looks like the reddit API is deliberately restricted here... I can't see a way guarantee to break down any arbitrary search and so find all posts containing any keyword. A site-restricted search on Google is probably therefore your best bet.Nanosecond
@PeterBrittain Can you give an example please. I can't find it on web.Industrious
You're now asking a very different question, which has been answered before. I don't think there's anything else to be covered in your original question now and so we should close this trail.Nanosecond
@PeterBrittain It's certainly a different methodology, but the question remains the same. It's my mistake to start with the wrong method at first.Industrious
C
0

So I would simply loop through a predetermined set of search queries, I'm assuming period is a time period? I'm also not sure what the format for it would be, so the below is largely made up, but you should get the gist.

In which case it would be something like the following

import praw

def get_data_reddit(search):
    username=""
    password=""
    r = praw.Reddit(user_agent='')
    r.login(username,password,disable_warning=True)
    title=[]

    periods = (time1, time2, time3, time4)  # declare a set of times to use in the search query to limit results

    for period in periods:  # loop through the different time points and query the posts from that time.
        posts=r.search(search, subreddit=None,sort=None, syntax=None,period=None,limit=None)  # this now returns a limited search query.

        for post in posts:
            title.append(post.title)  # and append as usual.
    print len(title)


search="stackoverflow"
get_data_reddit(search)
Chavis answered 23/6, 2015 at 11:27 Comment(5)
Thanks! I am looking for something like get the date of last reddit related to that query. Then checking each month/year from their now or some better way to calculate intervals. This seems more like a pseudocode than actual code.Industrious
For example: One starts from the time reddit started (23 June 2005) and then calculate the number posts in the current year if greater 1000 you further divide in months and so on. It would be a better solution I suspect.Industrious
I reckon reddit will receive more than a thousand posts in an hour, considering it can get around a million comments a day, so it would have to be cut down to quite small timescales - it would take a very large number of requests, and a long time, to calculate the number of posts in a year,Chavis
As mentioned in the answer, this isn't meant to be working code, I don't know the reddit API and I don't have a clear understanding of what you want to do. But the best way of getting around data limits is by using a more specific query - how you are going to do that is something you need to figure out as i can't say.Chavis
This seems to suggest something similar reddit.com/r/redditdev/comments/30a7ap/…Industrious

© 2022 - 2024 — McMap. All rights reserved.