MongoDB Schema Design - Voting on Posts
Asked Answered
S

2

8

Consider I have a website where I've got a bunch of articles and people can vote on the articles they like.

I want to be able to query to get the articles with the most votes within a certain time (last hour, last day, last week) ordered by the number of votes.

As usual with MongoDB there are several different ways to implement this, but I am not sure which one is correct.

  • A posts document, which contains an Array of Votes - Votes themselves are documents containing userid, username and vote date:
    {
    "_id": "ObjectId(xxxx)",
    "title": "Post Title",
    "postdate": "21/02/2012+1345",
    "summary": "Summary of Article",

    "Votes": [
        {
            "userid":ObjectId(xxxx),
            "username": "Joe Smith",
            "votedate": "03/03/2012+1436"
        },
            ]
     }
  • A separate votes collection, containing details of an individual vote and a reference to the post that was voted on:
{
    "_id": "ObjectId(xxxx)",
    "postId": ObjectId(xxxx),
    "userId": ObjectId(xxxx),
    "votedate": "03/03/2012+1436"
}

The first one is more Documentey but I have no idea how to query the votes array to get the documents with the most votes in the last 24 hours.

I'm leaning towards the second one as it would be easier to query the vote count grouped by vote I think, but I'm not sure how well it would perform. This is how you'd do it in Relational Databases, but it doesn't seem very documenty - but I'm not sure if its a problem, is it?

Or do I use a combination of the two? Also would I do this type of aggregate query in real-time, every page load. Or do I just run the query say once per minute and store the results in a query result collection?

How would you implement this schema?

Shag answered 3/7, 2012 at 13:50 Comment(2)
related question: #9297293Arana
Also related: #7046962Tobias
A
10

The common way to track counts for votes overall would be to keep the number of votes in the post document and to update it atomically when pushing a new value to the votes array.

Since it's a single update, you are guaranteed that the count will match the number of elements in the array.

If the number of aggregations is fixed and the site is very busy you could extend this paradigm and increment additional counters, like one for month, day and hour, but that could get out of hand very quickly. So instead you could use the new Aggregation Framework (available in 2.1.2 dev release, will be in production in release 2.2. It is simpler to use than Map/Reduce and it will allow you to do the calculations you want very simply especially if you take care to store your vote dates as ISODate() type.

Typical pipeline for aggregation query for top vote getters this month might look something like this:

today = new Date();
thisMonth = new Date(today.getFullYear(),today.getMonth());
thisMonthEnd = new Date(today.getFullYear(),today.getMonth()+1);

db.posts.aggregate( [
    {$match: { "Votes.votedate": {$gte:thisMonth, $lt:thisMonthEnd} } },
    {$unwind: "$Votes" },
    {$match: { "Votes.votedate": {$gte:thisMonth, $lt:thisMonthEnd} } },
    {$group: { _id: "$title", votes: {$sum:1} } },
    {$sort: {"votes": -1} },
    {$limit: 10}
] );

This limits the input to the pipeline to posts that have votes by matching vote dates to the month you are counting, "unwinds" the array to get one document per vote and then does a "group by" equivalent summing up all votes for each title (I'm assuming title is unique). It then sorts descending by number of votes and limits the output to first ten.

You also have the ability to aggregate votes by day (for example) for that month to see which days are most active for voting:

db.posts.aggregate( [
    {$match: { "Votes.votedate": {$gte:thisMonth, $lt:thisMonthEnd} } },
    {$unwind: "$Votes" },
    {$match: { "Votes.votedate": {$gte:thisMonth, $lt:thisMonthEnd} } },
    {$project: { "day" : { "$dayOfMonth" : "$Votes.votedate" }  } },
    {$group: { _id: "$day", votes: {$sum:1} } },
    {$sort: {"votes": -1} },
    {$limit: 10}
] );
Arana answered 3/7, 2012 at 16:13 Comment(8)
if you end up storing votes in their own collection rather than embedding in posts, then you wouldn't need the "$unwind" step, the rest of the aggregation would basically stay the same.Arana
note it's not an accident that I $match the desired month twice. First $match gets rid of posts that don't have any votes in the desired month, but second $match (after the $unwind) makes sure we only keep the votes that happened in that month before we count them up. The first $match is there to reduce the number of total documents we're feeding into the pipeline, it's not strictly necessary except for performance.Arana
I am trying to use the solution for referenced documents, but it's not working. In my test, I have 6 posts, and only one of them has a single vote. All others have no records in the posts_votes collection. If I run {$group: { _id: "$votes.post_id", votes: {$sum: 1 }}} I get one record back with an _id of null. If I change $votes.post_id to $title, it returns all 6 posts each with a single vote (should only be one post with a vote, all others have zero). Also tried it with adding $project array that makes a variable using $votes.post_idAffianced
Ask a new question. Comments aren't a good place to resolve this.Arana
Agreed. I had posted a question similar to this but not much traction. https://mcmap.net/q/1328258/-order-posts-by-most-votes-overall-last-month-etc-with-laravel-mongodb/239375Affianced
Great answer, but i am also confused. I am implementing a similar feature and I am considering to use a separate collection for votes (making the voting non-atomic) partly because of your answer here: askasya.com/post/largeembeddedarrays. You say in this answer that a "common" way is to have an embedded votes array, but would you say it is the recommended way?Ironworks
yes, so in real life you would not want to have the array growing unbounded so what you would likely implement would be a hybrid. Since immediate accurate counts are important for posts with small number of votes and less so for large number of votes you might keep only most recent X votes in the array and true up your counts asynchronously, or some variant of that. This example is more for showing atomic operation mechanics and probably isn't the best example for how to do things in "real world" unless number of voters is realistically limited by other real world factors .Arana
thanks for pointing this out, maybe I'll revisit this in a write-up on the blog.Arana
B
0

The schema you choose depends largely on your use-case..If you are expecting a lot of votes/comments and want to process them independently of the post they belong to, you might keep them in a separate collection with postID as the 'foriegn key'..However, if you want to load all the votes when you load a particular post and the votes in themselves don't have any meaning without the post that houses them, then go for the embedding (in your case, the first) approach.

Bacchic answered 3/7, 2012 at 14:27 Comment(1)
You can try mapreduce with the more documentey approach to query the votes array to get the documents with the most votes in the last 24 hours...Since mapreduce happens to be a heavy operation, it is better to run it only occasionally and use cached results.Bacchic

© 2022 - 2024 — McMap. All rights reserved.