I have faced a weird issue while querying one of our collections and aggregating the result out to another collection. I am querying unique users who have done some action and projecting the count of action performed per user in the aggregation query.
var result = db.getCollection('user_actions').aggregate(
[
{"$match":{"createdAt":{"$gte":1585161000000,"$lt":1585247340000}}},
{"$group":{"_id":{"accountId":"$user.id"},"count":{"$sum":1}}},
{"$sort": {"count": -1}},
{"$project":{"_id":0,"count":"$count","accountId":"$_id.accountId"}}
]
)
db.getCollection('winners').insert(result.toArray())
db.getCollection('winners').find({}).count()
The result of the above query inserts 12643 documents in the newly created collection.
I have also tried the query with allowDiskUse and cursor initial batch size value as well. But the stats of the collection as an output of the aggregation pipeline remained the same and neither am I getting any exceptions.
{
"ns" : "data.winners",
"size" : 1121979,
"count" : 12639,
"avgObjSize" : 88,
"storageSize" : 4096,
"capped" : false,
"wiredTiger" : {...},
"nindexes" : 1,
"totalIndexSize" : 4096,
"indexSizes" : {
"_id_" : 4096
},
"ok" : 1,
"operationTime" : Timestamp(1585321431, 41),
"$gleStats" : {...},
"lastCommittedOpTime" : Timestamp(1585321431, 41),
"$configServerState" : {...},
"$clusterTime" : {... }
}
That was more of a plot setting, now, when I remove the sort
query from the aggregation, it produces 26397 documents as a result. I could pick up some individual accounts from this result and verify that this is the correct result I should have got in the first place(sorted additionally). I have tried looking into MongoDB Limits and Thresholds but couldn't really find a relation to the observation.
Database version
db.version() 4.0.0
Shell details
mongo -version
MongoDB shell version v4.2.1
git version: edf6d45851c0b9ee15548f0f847df141764a317e
allocator: system
modules: none
build environment:
distarch: x86_64
target_arch: x86_64
Edit: Worth mentioning as discussed with Valijon in comments, using another stage in the aggregation pipeline to count the documents processed:
{"$count": "winners"}
results in the same value 26397 with and without the {"$sort": {"count": -1}}
stage.
It doesn't look to me like a sort memory limit behaviour either since that would have thrown an error according to the documentation.
sort
, I suspect its a processing threshold or limit that is being reached somewhere. – Carnes$count
as last stage foruser_actions
and (without$count
) after you insertwinners
? It's better use$out
which is faster (performed directly in MongoDB). – Dodiedodo.count
producing incorrect results since I could extract the entire collection as a CSV and validate that thecount
reported is actually correct. Related to$out
I had, in reality, made use of that stage in aggregation as well but it doesn't create a difference. – Carnes$count
foruser_actions
aggregation and after insertwinners
– Dodiedodo$count
stage used returns the correct value with and withoutsort
stage. So it tends to process threshold/limitation as I was guessing previously as well. – Carnesvalidate
the collection before getting its statistics. – Toxicitymongoexport
to dump the collection as a CSV and validated the content as well. Hence confident of the numbers that I've put up in the question. – Carnes