How to scrape all subreddit posts in a given time period
Asked Answered
T

1

6

I have a function to scrape all the posts in the Bitcoin subreddit between 2014-11-01 and 2015-10-31.

However, I'm only able to extract about 990 posts that go back only to October 25. I don't understand what's happening. I included a Sys.sleep of 15 seconds between each extract after referring to https://github.com/reddit/reddit/wiki/API, to no avail.

Also, I experimented with scraping from another subreddit (fitness), but it also returned around 900 posts.

require(jsonlite)
require(dplyr)

getAllPosts <- function() {
    url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
    extract <- fromJSON(url)
    posts <- extract$data$children$data %>% dplyr::select(name, author,   num_comments, created_utc,
                                             title, selftext)  
    after <- posts[nrow(posts),1]
    url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
    extract.next <- fromJSON(url.next)
    posts.next <- extract.next$data$children$data

    # execute while loop as long as there are any rows in the data frame
    while (!is.null(nrow(posts.next))) {
        posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc, 
                                    title, selftext)
        posts <- rbind(posts, posts.next)
        after <- posts[nrow(posts),1]
        url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
        Sys.sleep(15)
        extract <- fromJSON(url.next)
        posts.next <- extract$data$children$data
    }
    posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
    return(posts)
}

posts <- getAllPosts()

Does reddit have some kind of limit that I'm hitting?

Triplet answered 24/11, 2015 at 19:7 Comment(0)
P
4

Yes, all reddit listings (posts, comments, etc.) are capped at 1000 items; they're essentially just cached lists, rather than queries, for performance reasons.

To get around this, you'll need to do some clever searching based on timestamps.

Pasteurizer answered 24/11, 2015 at 23:48 Comment(2)
If I write a function that has a loop where each iteration captures 4 days worth of data, will that bypass Reddit's restrictions? In other words, will I be able to run this function and get the posts for the entire year?Triplet
Depends on whether you think it will be likely that there are 1000 posts in those 4 days. It probably would be easier to sort by new and use the timestamp of the last post you can access.Defant

© 2022 - 2024 — McMap. All rights reserved.