I have a somewhat complex query that is very critical to my application.
$cur = $col->find(
array (
'$or' => array(
array('owner' => $my_id),
array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F'))),
array('owner' => array('$in' => $friends), 'perm.list' => $my_id)
)
)
)->limit(10)->skip(0)->sort(array('ca' => -1));
The intention is to find the first 10 posts, sorted by their create time in desc order, which are:
a). made by myself, or b). made by my friends with permission types of 'P' for public, or'F' for friends, or c). made by my friends which the permission list has specifically designated me as a viewer.
The variable $friends is an array of user ids who are friends with me. perm.type has a total of 4 values, which are 'P', 'F', 'S', 'C'. perm.list is an array of user ids who has permission to view this post.
The above query works as intended in filtering out the correct results. But I ran into problem creating effective indexes on them.
The indexes I have created for this query are:
$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.type' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.list' => 1, 'ca' => -1));
The first index is designed for the first part of the query criteria, the 2nd index is designed for the 2nd criteria, and the 3rd is design for the 3rd criteria, and is a multikey index.
A typical post would look like this:
{
"_id": "...",
"owner": "001",
"perm": {
"type": "P",
"list": []
},
"msg": "Nice dress!",
"ca": 1390459269
}
Another example:
{
"_id": "...",
"owner": "007",
"perm": {
"type": "C",
"list": ["001", "005"]
},
"msg": "Nice day!",
"ca": 1390837209
}
I know of the limitation that existed prior to MongoDB version 2.6, which prevents the indexes being used when combining $or with sort(). The issue according to this http://jira.mongodb.org/browse/SERVER-1205 should have been fixed in 2.6.
And sure enough, explain() now shows the use of my indexes, where it didn't before in 2.4. But when I ran the query, it is now much slower than when it didn't use any indexes. explain() showed the nscanned is way higher than expected. After some searching, I found this issue https://jira.mongodb.org/browse/SERVER-3310 which seems to explain the problem I am experiencing. But as the ticket stated, this issue should have been fixed in 2.5.5, so what is causing my problem here?
I have tried to set up different indexes, compounding them in different orders, even separating them up, checking to see if the new index intersection feature would help. But none worked.
Does anyone know what my problem here is?
Edit After more testing, observing, and thinking, I have narrowed downed the issue, and it is really using $in, limit() and sort() all together in one query that's causing the problem. Adding a top level '$or' just doubles this problem for each '$or' clause. I will explain my logic below:
I have refined the my indexes to the following:
$col->ensureIndex(array('owner._id' => 1, 'ca' => -1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner._id' => 1))
The reasoning behind the first index is when I have millions of records, the query should start looking from the given set of user ids (friends) first to narrow down the choices. Then it goes through it in reverse chronological order of the records to check if each has the right permission type. The problem with this index is that the query optimizer has no idea of how many records it needs to scan in order to satisfy the limit(10) condition. It has no idea where the 10 most recent records will eventually come from, so it has to return up to 10 records from each id specified in the '$in' clause, then repeat the same thing for each '$or'. So if I have two '$or' clauses, each with an '$in' that consist of 100 user ids, then it will have to scan enough records to match 10 records from each of the users in the '$in' of the first '$or', then and 10 records from each of the users in the '$in' of the 2nd '$or', giving a return of 2000 records (this is the n returned in explain, and nscanned will be much higher depending on how many records it needs to scan to find the 2000 matches), and from this 2000 records, all chronologically ordered already, it takes the top 10 to return.
So, what if I build the index in the following order: "'ca' => -1, 'owner._id' => 1, 'perm.type' => 1"? Well, I can't really do that, because when I have hundreds of thousands of users, with millions of records, most records will be irrelevant to the viewer. So if I start from 'ca' => -1 first, it will scan a lot of irrelevant records before hitting one that fits the criteria, even though each hit that it founds will count directly against the limit(10), and it will only need to scan as many records as it needs to match 10 records that meet the criteria. But this scan can be tens of thousands of records, or even more. Worst yet, if 10 records can't be found, it will have to go through the entire index to find this out.
The 2nd index is to look at each record that is designated for me, go through it in reverse chronological order, and check and see if these come from my friends. This is pretty straight forward, and the problem here really is from the combination of using this, with '$in', limit() and sort() from above, all together in one query.
At this point, I am looking into solutions from merging results on the application side, but breaking up the '$or' to do on the application side is easy, but how do I break up the '$in' in the criteria array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F')))?
skip(0)
from your query. Can you add explain() output? – Kimberelykimberlee