What is the Cassandra database schema used in Reddit? [closed]
Asked Answered
S

1

9

Reddit is currently migrating its database from PosgreSQL to Apache Cassandra. Does anybody know what database schema does Reddit use in Cassandra?

Strangury answered 30/4, 2012 at 8:19 Comment(4)
I don't, and I'm not sure anyone outside of Reddit does, but should it really matter? I expect quite a few people here on SO could help determine the schema that's right for your application.Guarani
Reddit published the code that powers the site on GitHub: github.com/reddit/reddit . I could search in the code and determine the schema from there. But I thought that is easier to ask here.Strangury
I glanced through the code, and I see something like two dozen different columnfamilies that get created and used in different ways. Is there an area in particular you're wondering about, or are you looking more for something like show schema output from cassandra-cli?Guarani
I want to design an application that uses trees of comments as in Reddit. So, since my application is similar, I am trying to get some inspiration from there.Strangury
S
-2

I also don't know the exact Reddit schema, but for what you want to archive, you are on the right way, saving a hierarchy of comments in a document based database instead of a relational database. I would recommend to keep one document for each root-comment, and then add all the children (and children of the children) to that comment.

In CouchDB and MongoDB you can store JSON documents directly. In Cassandra I would save the JSON as a String. So the data structure would be only

root-comments
{
    root-comment-id
    root-comment-json-string
}

and each root-comment-json-string would look like this:

{
comment : "hello world"
answers : 
[
    {
    comment : "reply to hello world"
    answers : 
    [
        {
        comment : "thanks for the good reply"
        answers : []
        },
        {
        comment : "yes that reply was indeed awesome"
        answers : []
        }

    ]
    }

]
}

additionally you might want to add a UserName, UserID, Timestamp, .... etc. to the structure of each comment.

This 'denormalized' structure will make make the queries very fast compared to a normalized relational stucture IF you have A LOT of data.

In any case you will have to take care of all the exceptions, that can happen when you implement such a system for a large user scale, eg. What happens if someone replies to comment A with comment B, but at the same time (or later) comment A is deleted.

If you search the internet for "cassandra hierarchical data" you find some other approaches, but they all go back to normalisation or they are not complete for a 'infinite' hierarchy.

Sprite answered 2/1, 2014 at 6:3 Comment(4)
The problem with the approach you describe is that any time a new comment is added you have to update the JSON, that is, parse it, merge the comment in it and then write it to Cassandra. Imagine a tree with thousands of comments. So this approach is cheaper for retrieval, but expensive when updating. The relational approach is the other way around, expensive when retrieving a comment tree and cheap when updating. I think the best approach is a hybrid one, store the most import comments as you describe and the least important in a relational manner.Strangury
@Calin-AndreiBurloiu Yes true. It is my understanding that such a comment system (like in reddit) has far more reads than updates. So my answer is exactly the correct solution.Sprite
And the parsin of the result can easily done in JavaScript on the client, since it is JSON.Sprite
Kind of a newbie to this and your post helped to explain the structure a lot. In this case, how would inserting a comment work? I imagine there's a method addComment(parentId,commentString) - how do we find that particular node to insert a new child?Gilman

© 2022 - 2024 — McMap. All rights reserved.