Select top N rows from each group
Asked Answered
H

5

23

I use mongodb for my blog platform, where users can create their own blogs. All entries from all blogs are in an entries collection. The document of an entry looks like:

{
  'blog_id':xxx,
  'timestamp':xxx,
  'title':xxx,
  'content':xxx
}

As the question says, is there any way to select, say, last 3 entries for each blog?

Haphazard answered 27/6, 2011 at 20:22 Comment(0)
C
1

The only way to do this in basic mongo if you can live with two things :

  • An additional field in your entry document, let's call it "age"
  • A new blog entry taking an additional update

If so, here's how you do it :

  1. Upon creating a new intro do your normal insert and then execute this update to increase the age of all posts (including the one you just inserted for this blog) :

    db.entries.update({blog_id: BLOG_ID}, {age:{$inc:1}}, false, true)

  2. When querying, use the following query which will return the most recent 3 entries for each blog :

    db.entries.find({age:{$lte:3}, timestamp:{$gte:STARTOFMONTH, $lt:ENDOFMONTH}}).sort({blog_id:1, age:1})

Note that this solution is actually concurrency safe (no entries with duplicate ages).

Chandachandal answered 28/6, 2011 at 13:53 Comment(5)
Got your idea. I didn't think of anything like that. An additional update when creating a new post wouldn't be a problem. However when user deletes a post we'll have to update the 'age' field of all other posts. That update can be limited to happen only when the deleted post have 'age' <= 3 though. Am I missing anything?Haphazard
Yes you shouldn't limit that update to age < 3 because you'll end up with duplicate ages. In-place updates are extremely fast so it shouldn't be a problem. A delete means deleting the entry and decreasing the age by 1 where age > deleted_post.age. Good luck.Chandachandal
It is good for small amount of records with rare updates, but is it effective to use it with messaging system when I need to get 1 last message from each conversation between 2 users, when I have thousands of messages with new many new messages every minute? I think it isn't effective to do and update of "age" for thousands of messages each time. Could you advice something for that case?Renault
@oyatek Depends a bit on your exact usecase and the read/write ratios. If you open a question with your specific problem I'll have a look.Chandachandal
yes, the question is here - #9860213 - (I already marked it as answered by I'll appreciate your answer)Renault
D
34

You need to first sort the documents in the collection by the blog_id and timestamp fields, then do an initial group which creates an array of the original documents in descending order. After that you can slice the array with the documents to return the first 3 elements.

The intuition can be followed in this example:

db.entries.aggregate([
    { '$sort': { 'blog_id': 1, 'timestamp': -1 } }, 
    {       
        '$group': {
            '_id': '$blog_id',
            'docs': { '$push': '$$ROOT' },
        }
    },
    {
        '$project': {
            'top_three': { 
                '$slice': ['$docs', 3]
            }
        }
    }
])
Dispassionate answered 7/1, 2017 at 22:25 Comment(1)
If there are several thousands of documents for each group, I guess the group stage will keep them all in the docs array, while we just want last 3, and don't have to keep anything else. Do you know if there a way to make that more efficient (keeping at most 3 documents in docs) in Mongo 4.2? (I guess in 4.4 you could use a custom accumulator function.)Longhorn
O
4

Starting in Mongo 5.2, it's a perfect use case for the new $topN aggregation accumulator:

// { blog_id: "a", title: "plop",  content: "smthg" }
// { blog_id: "b", title: "hum",   content: "meh"   }
// { blog_id: "a", title: "hello", content: "world" }
// { blog_id: "a", title: "what",  content: "ever"  }
db.collection.aggregate([
  { $group: {
    _id: "$blog_id",
    messages: { $topN: { n: 2, sortBy: { _id: -1 }, output: "$$ROOT" } }
  }}
])
// {
//   _id: "a",
//   messages: [
//     { blog_id: "a", title: "what",  content: "ever" },
//     { blog_id: "a", title: "hello", content: "world" }
//   ]
// }
// {
//   _id: "b",
//   messages: [
//     { blog_id: "b", title: "hum", content: "meh" }
//   ]
// }

This applies a $topN group accumulation that:

  • takes for each group the top 2 (n: 2) elements
  • top 2, as defined by sortBy: { _id: -1 }, which in this case means by reversed order of insertion
  • and for each record pushes the whole record in the group's list (output: "$$ROOT") since $$ROOT represents the whole document being processed.
Othilia answered 20/12, 2021 at 9:2 Comment(0)
C
1

The only way to do this in basic mongo if you can live with two things :

  • An additional field in your entry document, let's call it "age"
  • A new blog entry taking an additional update

If so, here's how you do it :

  1. Upon creating a new intro do your normal insert and then execute this update to increase the age of all posts (including the one you just inserted for this blog) :

    db.entries.update({blog_id: BLOG_ID}, {age:{$inc:1}}, false, true)

  2. When querying, use the following query which will return the most recent 3 entries for each blog :

    db.entries.find({age:{$lte:3}, timestamp:{$gte:STARTOFMONTH, $lt:ENDOFMONTH}}).sort({blog_id:1, age:1})

Note that this solution is actually concurrency safe (no entries with duplicate ages).

Chandachandal answered 28/6, 2011 at 13:53 Comment(5)
Got your idea. I didn't think of anything like that. An additional update when creating a new post wouldn't be a problem. However when user deletes a post we'll have to update the 'age' field of all other posts. That update can be limited to happen only when the deleted post have 'age' <= 3 though. Am I missing anything?Haphazard
Yes you shouldn't limit that update to age < 3 because you'll end up with duplicate ages. In-place updates are extremely fast so it shouldn't be a problem. A delete means deleting the entry and decreasing the age by 1 where age > deleted_post.age. Good luck.Chandachandal
It is good for small amount of records with rare updates, but is it effective to use it with messaging system when I need to get 1 last message from each conversation between 2 users, when I have thousands of messages with new many new messages every minute? I think it isn't effective to do and update of "age" for thousands of messages each time. Could you advice something for that case?Renault
@oyatek Depends a bit on your exact usecase and the read/write ratios. If you open a question with your specific problem I'll have a look.Chandachandal
yes, the question is here - #9860213 - (I already marked it as answered by I'll appreciate your answer)Renault
S
0

It's possible with group (aggregation), but this will create a full-table scan.

Do you really need exactly 3 or can you set a limit...e.g.: max 3 posts from the last week/month?

Sammiesammons answered 27/6, 2011 at 21:42 Comment(1)
Ideally I want to select exactly 3, but max 3 posts from last month can be good enough if I can't find a solution except data denormalization. Could you give me an example of how this can be done please? From all the mongodb's map reduce tutorials I've read they only show how to calculate stats (aggregation)...Haphazard
A
0

This answer using map reduce by drcosta from another question did the trick

In mongo, how do I use map reduce to get a group by ordered by most recent

mapper = function () {
  emit(this.category, {top:[this.score]});
}

reducer = function (key, values) {
  var scores = [];
  values.forEach(
    function (obj) {
      obj.top.forEach(
        function (score) {
          scores[scores.length] = score;
      });
  });
  scores.sort();
  scores.reverse();
  return {top:scores.slice(0, 3)};
}

function find_top_scores(categories) {
  var query = [];
  db.top_foos.find({_id:{$in:categories}}).forEach(
    function (topscores) {
      query[query.length] = {
        category:topscores._id,
        score:{$in:topscores.value.top}
      };
  });
  return db.foo.find({$or:query});
Avignon answered 25/12, 2011 at 3:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.