How to efficiently carry out a priority-based $or query in MongoDB with 10 billion records?
Asked Answered
F

2

8

I have a MongoDdb collection of about 10 billion documents. I want to be able to search in the collection using an $or query, but give priority over the conditions.

Here is my existing code:

const prefixChar = 'A';
const latterword = 'XYZ';

await data.find({
  $or: [
    {
      prefix: prefixChar,
      number: { $exists: true },
      latter: latterword
    },
    {
      prefix: { $exists: true },
      number: { $exists: true },
      latter: latterword
    },
    {
      prefix: prefixChar,
      number: { $exists: true },
      latter: { $regex: `^${latterword[0]}${latterword[1]}` }
    },
    {
      prefix: prefixChar,
      number: { $exists: true },
      latter: { $regex: `^${latterword[0]}` }
    },
    {
      prefix: prefixChar,
      number: { $exists: true },
      latter: { $exists: true }
    }
  ]
}).limit(12);

I have also tried to execute the queries one by one. It takes too much time. I need the response time between 0 to 2000 milliseconds. I have already put an index on those three fields and tried parallel execution of a query in Python, but with billions of records, it's still too slow.

Fillender answered 21/8, 2024 at 17:19 Comment(1)
Not directly related to the question - note that there is a subtle difference between testing {$exists: true} and {$ne:null} See the different results from Ployground-ne-null and Playground-exists. These may also interact differently with indexing, and have performance implications at scale.Sylvester
I
3

You can achieve this with some Boolean reduction.

  1. For starters, since number: { $exists: true } is in all the criteria, separate that from $or clauses and use $and to include it. Like:

    data.find({
      $and: [
        { number: { $exists: true } },
        {
          $or: [
            {
              prefix: prefixChar,
              latter: latterword
            }
            ...  // all the other criteria here
          ]
        }
      ]
    })
    
  2. In the 3rd & 4th clauses, the difference is only

    latter: { $regex: `^${latterword[0]}${latterword[1]}` }
    

    vs

    latter: { $regex: `^${latterword[0]}` }
    

    So the 4th case already covers the 3rd case and is more broad, so the 3rd case can be removed. Now, your remaining query is this.

  3. For a regex to match, or for a string to match - as required by the remaining clauses - the field has to exist. So the exists requirements can be moved into the first number-exists check anyway.

    Now the you're left with these in the $or part:

    {
      $or: [
        {
          prefix: "prefixChar",
          latter: "latterword"
        },
        { latter: "latterword" },
        {
          prefix: "prefixChar",
          latter: { $regex: `^${latterword[0]}` }
        },
        { prefix: "prefixChar" }
      ]
    }
    
  4. Excluding the regex, that means either prefix matches or latter matches, or both but that's covered in the either case already:

    {
      $or: [
        { latter: "latterword" },
        {
          prefix: "prefixChar",
          latter: { $regex: `^${latterword[0]}` }
        },
        { prefix: "prefixChar" }
      ]
    }
    

So, finally it's:

data.find({
  $and: [
    {
      prefix: { $exists: true },
      number: { $exists: true },
      latter: { $exists: true }
    },
    {
      $or: [
        { latter: "latterword" },
        {
          prefix: "prefixChar",
          latter: { $regex: `^${latterword[0]}` }
        },
        { prefix: "prefixChar" }
      ]
    }
  ]
})

Also, you should sort when using limit; otherwise results will be in a random order every time you execute it.

Ioneionesco answered 21/8, 2024 at 18:6 Comment(2)
You are right; the query is better. I have used similar approaches but the thing is I need to fetch data on the basis of priority. Suppose with a query like { prefix: prefixChar, number: { $exists: true }, latter: latterword }, suppose the first query returns 7 records then I need those 7 records at the top. Then, I move to the next query. If the second query returns 5 records then the search stops and it returns the total of 12 records.Fillender
Then you have no choice but to do it with different queries or use $unionWith with the or clauses in the order you want.Ioneionesco
U
1

From the scale of the dataset, I am guessing it is likely to be some datalake collection that might have a high read-write ratio. i.e. read frequently but not updated frequently. If that is the case and your parameter does not change frequently, you may consider precomputing the priority and store each execution's results in a separate collection such that you can refer the same execution result in future.

db.collection.aggregate([
  {
    "$set": {
      priority: {
        $let: {
          vars: {
            prefixChar: "A",
            latterword: "XYZ"
          },
          "in": {
            "$switch": {
              "branches": [
                {
                  "case": {
                    "$and": [
                      {
                        "$eq": [
                          "$prefix",
                          "$$prefixChar"
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$number",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        "$eq": [
                          "$latter",
                          "$$latterword"
                        ]
                      }
                    ]
                  },
                  "then": 1
                },
                {
                  "case": {
                    "$and": [
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$prefix",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$number",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        "$eq": [
                          "$latter",
                          "$$latterword"
                        ]
                      }
                    ]
                  },
                  "then": 2
                },
                {
                  "case": {
                    "$and": [
                      {
                        "$eq": [
                          "$prefix",
                          "$$prefixChar"
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$number",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        $eq: [
                          0,
                          {
                            "$indexOfCP": [
                              "$latter",
                              {
                                "$substrCP": [
                                  "$$latterword",
                                  0,
                                  2
                                ]
                              }
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  "then": 3
                },
                {
                  "case": {
                    "$and": [
                      {
                        "$eq": [
                          "$prefix",
                          "$$prefixChar"
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$number",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        $eq: [
                          0,
                          {
                            "$indexOfCP": [
                              "$latter",
                              {
                                "$substrCP": [
                                  "$$latterword",
                                  0,
                                  1
                                ]
                              }
                            ]
                          }
                        ]
                      }
                    ]
                  },
                  "then": 4
                },
                {
                  "case": {
                    "$and": [
                      {
                        "$eq": [
                          "$prefix",
                          "$$prefixChar"
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$number",
                              null
                            ]
                          },
                          null
                        ]
                      },
                      {
                        "$ne": [
                          {
                            "$ifNull": [
                              "$latter",
                              null
                            ]
                          },
                          null
                        ]
                      }
                    ]
                  },
                  "then": 5
                }
              ],
              "default": 6
            }
          }
        }
      }
    }
  },
  {
    "$group": {
      "_id": "$priority",
      "docIds": {
        "$push": "$_id"
      }
    }
  },
  {
    "$set": {
      "_id": {
        "executionId": "fill in your preferred id",
        "priority": "$_id",
        "prefixChar": "A",
        "latterword": "XYZ"
      }
    }
  },
  {
    "$merge": {
      "into": "precomputed"
    }
  }
])

Mongo Playground


Afterwards, when you are trying to fetch the result, you can start a $lookup from your precomputed collection. The precomputed collection can be indexed to boost the performance.

db.precomputed.aggregate([
  {
    "$match": {
      "_id.executionId": "fill in your preferred id"
    }
  },
  {
    "$sort": {
      "_id.priority": 1
    }
  },
  {
    "$unwind": "$docIds"
  },
  {
    "$limit": 12
  },
  {
    "$lookup": {
      "from": "collection",
      "localField": "docIds",
      "foreignField": "_id",
      "as": "rawDocsLookup"
    }
  },
  {
    "$unwind": "$rawDocsLookup"
  },
  {
    "$replaceRoot": {
      "newRoot": {
        "$mergeObjects": [
          "$rawDocsLookup",
          {
            priority: "$_id.priority"
          }
        ]
      }
    }
  }
])

Mongo Playground

Uncrown answered 21/8, 2024 at 18:34 Comment(1)
precomputing the scores +1 to thatIoneionesco

© 2022 - 2025 — McMap. All rights reserved.