Mongo aggregate query results in less document with sorting
Asked Answered
C

0

2

I have faced a weird issue while querying one of our collections and aggregating the result out to another collection. I am querying unique users who have done some action and projecting the count of action performed per user in the aggregation query.

var result = db.getCollection('user_actions').aggregate(
[
  {"$match":{"createdAt":{"$gte":1585161000000,"$lt":1585247340000}}},
  {"$group":{"_id":{"accountId":"$user.id"},"count":{"$sum":1}}},
  {"$sort": {"count": -1}},
  {"$project":{"_id":0,"count":"$count","accountId":"$_id.accountId"}}
]
)
db.getCollection('winners').insert(result.toArray())
db.getCollection('winners').find({}).count()

The result of the above query inserts 12643 documents in the newly created collection.

I have also tried the query with allowDiskUse and cursor initial batch size value as well. But the stats of the collection as an output of the aggregation pipeline remained the same and neither am I getting any exceptions.

{
    "ns" : "data.winners",
    "size" : 1121979,
    "count" : 12639,
    "avgObjSize" : 88,
    "storageSize" : 4096,
    "capped" : false,
    "wiredTiger" : {...},
    "nindexes" : 1,
    "totalIndexSize" : 4096,
    "indexSizes" : {
        "_id_" : 4096
    },
    "ok" : 1,
    "operationTime" : Timestamp(1585321431, 41),
    "$gleStats" : {...},
    "lastCommittedOpTime" : Timestamp(1585321431, 41),
    "$configServerState" : {...},
    "$clusterTime" : {...   }
}

That was more of a plot setting, now, when I remove the sort query from the aggregation, it produces 26397 documents as a result. I could pick up some individual accounts from this result and verify that this is the correct result I should have got in the first place(sorted additionally). I have tried looking into MongoDB Limits and Thresholds but couldn't really find a relation to the observation.

Database version

db.version()
4.0.0

Shell details

mongo -version
MongoDB shell version v4.2.1
git version: edf6d45851c0b9ee15548f0f847df141764a317e
allocator: system
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64

Edit: Worth mentioning as discussed with Valijon in comments, using another stage in the aggregation pipeline to count the documents processed:

{"$count": "winners"}

results in the same value 26397 with and without the {"$sort": {"count": -1}} stage.


It doesn't look to me like a sort memory limit behaviour either since that would have thrown an error according to the documentation.

Carnes answered 28/3, 2020 at 5:26 Comment(15)
did you try to use $out stage directly in your aggregation query, instead of manually inserting result?Hierarchize
@Hierarchize Yes. I did but with the same outcome. As I mentioned, the difference was mainly observed due to the removal of sort, I suspect its a processing threshold or limit that is being reached somewhere.Carnes
Sometimes .count() doesn't perform predict and return results based on the collection’s metadata. Can you add $count as last stage for user_actions and (without $count) after you insert winners? It's better use $out which is faster (performed directly in MongoDB).Dodiedodo
@Dodiedodo It's not a matter of the .count producing incorrect results since I could extract the entire collection as a CSV and validate that the count reported is actually correct. Related to $out I had, in reality, made use of that stage in aggregation as well but it doesn't create a difference.Carnes
@Carnes can share your data so we can reproduce the issue?Dodiedodo
@Dodiedodo Not possible for privacy as well as the data size concerns. But if you're looking for any information around the query or operation do let me know. I would update the question with it.Carnes
Can you share what you get with $count for user_actions aggregation and after insert winnersDodiedodo
@Dodiedodo Edited the question, the $count stage used returns the correct value with and without sort stage. So it tends to process threshold/limitation as I was guessing previously as well.Carnes
Couldn't it be a problem of the statistics themselves? Why don't you try to validate the collection before getting its statistics.Toxicity
@Toxicity Good point. I have used mongoexport to dump the collection as a CSV and validated the content as well. Hence confident of the numbers that I've put up in the question.Carnes
Can you try putting this data into an updated mongo version and repeating the experiment? There were a couple of bugs fixed concerning sorting in pipelines or in general between 4.0.0 and 4.0.17, most notably this oneCounterstatement
@Counterstatement Thank you for the comment. To test that out in different versions would be a decent amount of effort and I am afraid I wouldn't have that much time in the near future. But I would keep that as a bookmark if there is no solution to the problem found in the meanwhile. The link that you shared states, ...find commands and aggregation pipelines do not guarantee the same sort order when at least one of the documents in a collection is missing at least one of the fields being sorted on., which makes me wonder if the query in question is eligible for the bug since I sort on grouped by count.Carnes
You dont necessarily have to test out every version. I would probably go for the latest 4.0.x release and if that changes nothing the latest overall stable release. You can spin up a new test instance pretty quick with docker and since it doesnt have to do anything more than that one aggregation you can just selectively dump out the collections that you need with mongodumpCounterstatement
@Counterstatement I too meant a single version upgrade. But if nothing else, I would try that as well on a later stage. The point is there is still no cause that I am aware of about this failure and that is what I was hoping for when I shared the content here. I am aware of the setup but it's not easy or a routine for me to upgrade a Mongo version in production(there are integrations). On the other hand, solve without sorting and in-memory processing has got me done with my job anyway.Carnes
@Carnes Any update about the issue? I'm having exact problem right now.Butters

© 2022 - 2024 — McMap. All rights reserved.