Why is full text search of MongoDB shards directly much faster than going through the cluster manager (mongos) instance?
Asked Answered
A

1

16

I have been very unhappy with full text search performance in MongoDB so I have been looking for outside-the-box solutions. With a relatively small collection of 25 million documents sharded across 8 beefy machines (4 shards with redundancy) I see some queries taking 10 seconds. That is awful. On a lark, I tried a 10 second query to the shards directly, and it seems like the mongos is sending the queries to shards serially, rather than in parallel. Across the 4 shards I saw search times of 2.5 seconds on one shard and the other 3 shards under 2 seconds each. That is a total of less than 8.5 seconds, but it took 10 through mongos. Facepalm.

Can someone confirm these queries to shards are being run serially? Or offer some other explanation?

What are the pitfalls to querying the shards directly?

We are on 4.0 and the query looks like this:

db.items.aggregate(
[
   { "$match" : {
    "$text" : { "$search" : "search terms"}
      }
   }, 
   { "$project": { "type_id" : 1, "source_id": 1 } },
   { "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);

I made a mistake before, this is the query being sent that has the issue. And I talked to a MongoDB expert and was clued into a big part of what's going on (I think), but happy to see what others have to say so I can pay the bounty and make it official.

Anacoluthia answered 31/8, 2018 at 17:58 Comment(7)
What's your mongo version? And what's your query look like? Does it use another condition rather than $search, or additional sorting?Ize
Updated the description.Anacoluthia
Could you give us some details about how you have created the text index? How are your shards organized? Do you have replica set (how many)?Caliber
if know which queries are taking longer you should check whether these queries are using btreeCursor using db.collectionName.find({Query}).explain()Sortie
@ChrisSeline How do you measure your query time? I am pretty sure scatter-gather is running parallellyIze
As @bhagwanparge said, can you post the output of explain()Inlier
Updated the description, I made a mistake when grabbing the actual query used for this timing issue, that is now updated. To answer the other questions: 4 shards each with 2 replicas. Measured using Robo 3T's query time.Anacoluthia
P
2

Can someone confirm these queries to shards are being run serially? Or offer some other explanation?

Without a shard key in the query, the query is sent to all shards and processed in parallel. However, the results from all shards will be merged at the primary shard, and thus it'll wait until the slowest shard returns.

What are the pitfalls to querying the shards directly?

You can potentially include orphaned documents. Query via mongos also checks orphaned documents to ensure data consistency. Therefore, querying via mongos has more overhead than querying directly from each shard.

Measured using Robo 3T's query time

Using Robo 3T doesn't measure the query time correctly. By default, Robo 3T returns first 50 documents. For driver implementations, if the number of returned documents is more than the default batch size, to retrieve the all docs, there will be getmore requests followed to database. Robo 3T only gives you the first batch, i.e. a subset of results.

To evaluate your query, add explain('executionStats') to your query. The performance hit is likely the data transfer between shards. Because the lacking of a shard key in the query, the results of all shards have to be sent to a shard before merging. The total time is not only the query time (locating the docs) from mongo engine, but also documents retrieval time.

Execute the command below and you'll see inputStages from each shard to better evaluate your query.

db.items.explain('executionStats').aggregate(
[
   { "$match" : {
    "$text" : { "$search" : "search terms"}
      }
   }, 
   { "$project": { "type_id" : 1, "source_id": 1 } },
   { "$facet" : { "types" : [ { "$unwind" : "$type_id"} , { "$sortByCount" : "$type_id"}] , "sources" : [ { "$unwind" : "$source_id"} , { "$sortByCount" : "$source_id"}]}}
]
);
Pyrogenous answered 21/1, 2019 at 14:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.