MongoDB 2.6 Index set up, query using $or, $in, with limit and sort

P

3

6

I have a somewhat complex query that is very critical to my application.

$cur = $col->find(
    array (
        '$or' => array(
            array('owner' => $my_id),
            array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F'))),
            array('owner' => array('$in' => $friends), 'perm.list' => $my_id)
        )
    )
)->limit(10)->skip(0)->sort(array('ca' => -1));

The intention is to find the first 10 posts, sorted by their create time in desc order, which are:

a). made by myself, or b). made by my friends with permission types of 'P' for public, or'F' for friends, or c). made by my friends which the permission list has specifically designated me as a viewer.

The variable $friends is an array of user ids who are friends with me. perm.type has a total of 4 values, which are 'P', 'F', 'S', 'C'. perm.list is an array of user ids who has permission to view this post.

The above query works as intended in filtering out the correct results. But I ran into problem creating effective indexes on them.

The indexes I have created for this query are:

$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.type' => 1, 'ca' => -1));
$col->ensureIndex(array('owner' => 1, 'perm.list' => 1, 'ca' => -1));

The first index is designed for the first part of the query criteria, the 2nd index is designed for the 2nd criteria, and the 3rd is design for the 3rd criteria, and is a multikey index.

A typical post would look like this:

{
    "_id": "...",
    "owner": "001",
    "perm": {
        "type": "P",
        "list": []
    },
    "msg": "Nice dress!",
    "ca": 1390459269
}

Another example:

{
    "_id": "...",
    "owner": "007",
    "perm": {
        "type": "C",
        "list": ["001", "005"]
    },
    "msg": "Nice day!",
    "ca": 1390837209
}

I know of the limitation that existed prior to MongoDB version 2.6, which prevents the indexes being used when combining $or with sort(). The issue according to this http://jira.mongodb.org/browse/SERVER-1205 should have been fixed in 2.6.

And sure enough, explain() now shows the use of my indexes, where it didn't before in 2.4. But when I ran the query, it is now much slower than when it didn't use any indexes. explain() showed the nscanned is way higher than expected. After some searching, I found this issue https://jira.mongodb.org/browse/SERVER-3310 which seems to explain the problem I am experiencing. But as the ticket stated, this issue should have been fixed in 2.5.5, so what is causing my problem here?

I have tried to set up different indexes, compounding them in different orders, even separating them up, checking to see if the new index intersection feature would help. But none worked.

Does anyone know what my problem here is?

Edit After more testing, observing, and thinking, I have narrowed downed the issue, and it is really using $in, limit() and sort() all together in one query that's causing the problem. Adding a top level '$or' just doubles this problem for each '$or' clause. I will explain my logic below:

I have refined the my indexes to the following:

$col->ensureIndex(array('owner._id' => 1, 'ca' => -1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner._id' => 1))

The reasoning behind the first index is when I have millions of records, the query should start looking from the given set of user ids (friends) first to narrow down the choices. Then it goes through it in reverse chronological order of the records to check if each has the right permission type. The problem with this index is that the query optimizer has no idea of how many records it needs to scan in order to satisfy the limit(10) condition. It has no idea where the 10 most recent records will eventually come from, so it has to return up to 10 records from each id specified in the '$in' clause, then repeat the same thing for each '$or'. So if I have two '$or' clauses, each with an '$in' that consist of 100 user ids, then it will have to scan enough records to match 10 records from each of the users in the '$in' of the first '$or', then and 10 records from each of the users in the '$in' of the 2nd '$or', giving a return of 2000 records (this is the n returned in explain, and nscanned will be much higher depending on how many records it needs to scan to find the 2000 matches), and from this 2000 records, all chronologically ordered already, it takes the top 10 to return.

So, what if I build the index in the following order: "'ca' => -1, 'owner._id' => 1, 'perm.type' => 1"? Well, I can't really do that, because when I have hundreds of thousands of users, with millions of records, most records will be irrelevant to the viewer. So if I start from 'ca' => -1 first, it will scan a lot of irrelevant records before hitting one that fits the criteria, even though each hit that it founds will count directly against the limit(10), and it will only need to scan as many records as it needs to match 10 records that meet the criteria. But this scan can be tens of thousands of records, or even more. Worst yet, if 10 records can't be found, it will have to go through the entire index to find this out.

The 2nd index is to look at each record that is designated for me, go through it in reverse chronological order, and check and see if these come from my friends. This is pretty straight forward, and the problem here really is from the combination of using this, with '$in', limit() and sort() from above, all together in one query.

At this point, I am looking into solutions from merging results on the application side, but breaking up the '$or' to do on the application side is easy, but how do I break up the '$in' in the criteria array('owner' => array('$in' => $friends), 'perm.type' => array('$in' => array('P', 'F')))?

Penury answered 24/4, 2014 at 8:19 Comment(3)

A small note, you can remove the skip(0) from your query. Can you add explain() output? – Kimberelykimberlee 24/4, 2014 at 8:56

The skip is actually there for pagination. The whole explain is extremely long (spams over 20 pages), however, I will try to paste some of it into the question. – Penury 24/4, 2014 at 9:12

From your explain - your testing collection has only 60 documents? – Kimberelykimberlee 24/4, 2014 at 9:44

P

0

After 3 days of testing and research, the reason that is causing the inefficient queries is now clear. MongoDB at current version (2.6.1) is still unable to optimize queries that uses $or, $in, limit() and sort() all at once. The https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixes, each only improved performance on queries having 3 out of the 4 operations listed above. When introducing a 4th operation into the query, the optimization goes out the window. This behavior is observed with full index and document scans within the $or clause, even though limit(10) is specified.

The attempt to solve this problem by breaking up the $or clauses individually and merge results on the application side, while feasible, ran into major obstacles when I attempted to implement pagination.

My current solution thus, is to come up with an equivalent query to the original query, while only using 3 out of the 4 operations. I decided to 'flatten' the '$in' operator, turn each element within the $friends array into another '$or' condition with an exact owner value to be queried for. So instead of having 3 '$or' conditions in my original query, I now have as many '$or' conditions as I have elements in my $friends array, plus the 2 other original '$or' conditions.

The query is now optimized. When ran with explain(), the nscannedObjects, and nscanned is now way down, to values they are suppose to be. Considering the documentation on '$or' stating

When using indexes with $or queries, each clause of an $or will execute in parallel. These clauses can each use their own index.

This may actually be an acceptable solution performance-wise. I hope this will helps anyone who ran into the same problems I did.

Penury answered 26/4, 2014 at 7:46 Comment(0)

K

1

I'm not sure if this is a bug in MongoDB 2.6 but you can take a look at this article about index creation.

The order of fields in an index should be:

1. First, fields on which you will query for exact values.
2. Second, fields on which you will sort.
3. Finally, fields on which you will query for a range of values.

So following that advice, you can try with this indexes:

$col->ensureIndex(array('owner' => 1, 'ca' => -1));
$col->ensureIndex(array('ca' => -1, 'owner' => 1, 'perm.type' => 1));
$col->ensureIndex(array('perm.list' => 1, 'ca' => -1, 'owner' => 1));

Edit:

From your explain, if you're testing on small data sets, full collection is fast because MongoDB doesn't need to go through a lot of documents. You should try to do a test with e.g 10000 documents to see a real difference. Values for your fields in indexes should be different enough to ensure index selectivity for your queries (e.g. not all documents are from the same owner).

Kimberelykimberlee answered 24/4, 2014 at 9:26 Comment(1)

Thank you for your advice, I will try that. I am still testing different order and combinations of the compound indexes, and I am getting different results. When I run the query as one user, query optimizer would pick one index, and when I run it as a different user, it would prefer a different index, even though the queries are the same. – Penury 24/4, 2014 at 11:28

S

0

TL;DR: I believe you are using the wrong algorithm/data structure for the tool, or vice-versa. I'd suggest using a fan-out approach as discussed in this SO question, or my blog post. Sorry for shamelessly advertising my previous posts, but it doesn't make sense to repeat that info here.

The philosophy of MongoDB is, contrary to the typical SQL-philosophy, to be rather write-heavy. You are essentially trying to implement a ranking algorithm in a MongoDB query, but MongoDB's query philosophy is "query by example". That's not a good fit.

Sure, the aggregation pipeline is doesn't fit that philosophy anymore, and things maybe will change. There are optimizations on the way that allow for more complex queries, such as index intersection.

Still, what you are doing here is very hard to control. You not only want MongoDB to use index intersection (new in 2.6, only works with two indexes currently), but you're also combining it with $in queries and compound indexes. That's a lot to ask, and if the number of friends in the $in grows too much, you're all outta luck anyhow. The same is true if a piece of news is shared with too many people, worst case a document grows past 16MB. Growing documents is expensive, complex queries are expensive, large documents are expensive too.

I suggest you use a fan-out approach for newsfeeds where you can implement a very complex ranking algorithm in code, rather than in MongoDB.

I'm not saying it's impossible to optimize your query, but since explain's output is so ginourmous and there are so many effects interacting here (typical array sizes, typical match ratio, selectivity of the indexes, etc.), it will be very hard to find a good solution for this problem, even for someone who has full access to the data (i.e. you).

Even if you get this to work, you might run into critical problems if your access patterns change, the data changes, etc., so you'll be dealing with a fragile construct.

Seabolt answered 24/4, 2014 at 10:23 Comment(2)

Your blog was a nice read. Thank you for sharing it. The fan-out approach in your blog is interesting, but that also is dependent on your application's use case. For example, in my app, a lot of the feeds will be public, meaning everyone, even people without an account can read them. – Penury 24/4, 2014 at 11:22

Thanks for the feedback! I think the approach can be adjusted, it's definitely not copy & paste, but you could e.g. introduce a 'public' feed that ranks the most important news items for anonymous users, or by having anon users directly read from the actual Action collection. – Seabolt 24/4, 2014 at 11:38

P

0

After 3 days of testing and research, the reason that is causing the inefficient queries is now clear. MongoDB at current version (2.6.1) is still unable to optimize queries that uses $or, $in, limit() and sort() all at once. The https://jira.mongodb.org/browse/SERVER-1205 and https://jira.mongodb.org/browse/SERVER-3310 fixes, each only improved performance on queries having 3 out of the 4 operations listed above. When introducing a 4th operation into the query, the optimization goes out the window. This behavior is observed with full index and document scans within the $or clause, even though limit(10) is specified.