MongoDB Full-Text Search Score "What does Score means?"

Asked 27/3, 2017 at 8:33 Answered 26/8, 2024 at 7:3

I'm working on a MongoDB project for my school. I have a Collection of sentences, and I do a normal Text search to find the most similar sentence in the collection, this is based on the scoring.

I run this Query

db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

Take a look at these results when i query sentences,

"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0*

What is the score value? what does it mean? What if I want to show the results that only have similarity of 70% and above.

How can I interpret the score result so I can display a similarity percentage, I'm using C# to do this, but don't worry about the implementation. I don't mind a Pseudo-code solution!

Lelandleler answered 27/3, 2017 at 8:33 Comment(4)

What does similarity of 70% mean? What kind of score do you want to use for measuring similarity? – Gusty 27/3, 2017 at 8:48

I'm actually trying to make a Plagiarism Software where you upload your document and then each sentence will be compared to a pool of sentences. So, when the highest Score sentence is similar by 70% or more, there is a probability of plagiarism. – Lelandleler 27/3, 2017 at 9:20

@NasriYatim did you manage to find out how? – Swiercz 21/5, 2019 at 8:51

Hi Nasri, I m also new to MongoDB, for me I need to search the name " Raja Sekar " from name field i have indexed it. But my condition is search term should match 75 percent of similar records. Can you please help me on this – Burdett 28/9, 2020 at 15:40

When you use a MongoDB text index, it generates a score for every matching document. This score indicates how strongly your search string matches the document. The higher the score more is the chances of resemblance to the searched text. The score is calculated by:

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
       
score = (weight * data.freq * coeff * adjustment);
       
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = (0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)

So as we can see above a score is influenced by the following factors:

Number of Terms matching with the actual searched text, more the match more will be the score
Number of tokens in the document field
Whether the searched text exactly matches the document field or not

Following is the derivation for one of your document:

Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?

Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
    Token 1: "sentence"
    Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
        
      Step 3: Take Sample Document and Remove Stop Words
            Input Document:  Who is the “He” in this sentence?
            Document after stop word removal: "sentence"
      Step 4: Apply Stemming 
        Document in Step 3: "sentence"
        After Stemming : "sentence"
      Step 5: Calculate data.count per search token 
              data.count(sentence)= 1
              data.count(nothing)= 1
      Step 6: Calculate total number of token in document
              numTokens = 1
      Step 7: Calculate coefficient per search token
              coeff = (0.5 * data.count / numTokens) + 0.5
              coeff(sentence) = 0.5*(1/1) + 0.5 = 1.0
              coeff(nothing) = 0.5*(1/1) + 0.5 = 1.0    
      Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
              adjustment(sentence) = 1
              adjustment(nothing) = 1
      Step 9: weight of field (1 is default weight)
              weight = 1
      Step 10: Calculate frequency of search token in document (data.freq)
           For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
            a. Data.freq(sentence)= 1/(2^0) = 1
            b. Data.freq(nothing)= 0
      Step 11: Calculate score per search token per field:
         score = (weight * data.freq * coeff * adjustment);
         score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
         score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0

In the same way, you can derive the other one.

For more detailed MongoDB analysis, check: Mongo Scoring Algorithm Blog

Hurty answered 31/8, 2020 at 11:38 Comment(0)

Text search assigns a score to each document that contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.

For each indexed field in the document, MongoDB multiplies the number of matches by the weight and sums the results. Using this sum, MongoDB then calculates the score for the document.

The default weight is 1 for the indexed fields.

https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/

Infect answered 27/3, 2017 at 9:11 Comment(1)

Instead of plagiarising, explaining it with examples would really help. – Swiercz 21/5, 2019 at 9:0

You can normalize the score in the range from 0 to 1 in the subsequent stages of your aggregation pipeline.

As example:

pipeline = [
    {
        "$match": {
            "$and": [
                {"userId": {"$in": user_ids}},
                {
                    "$text": {
                        "$search": keywords,
                        "$caseSensitive": False,
                        "$diacriticSensitive": False,
                    },
                },
            ]
        }
    },
    {"$addFields": {"score": {"$meta": "textScore"}}},
    {"$setWindowFields": {"output": {"maxScore": {"$max": "$score"}}}},
    {"$addFields": {"normalizedScore": {"$divide": ["$score", "$maxScore"]}}},
    {"$match": {"normalizedScore": {"$gte": 0.7}}},
    {"$sort": {"normalizedScore": -1}},
]

I needed a similar functionality, in the above example:

Create a aggregation pipeline to search and filter by ids into my collection
Add score field to save similar document score with the search word
Calculate and create the max score from all search results
Add normalizedScore field to store normalized valut in range of 0 to 1
And finally I use normalizedScore to limit and sort results.

I based on the next mongodb documentation: Normalize the Score

Nippy answered 26/8, 2024 at 7:3 Comment(0)

Recommended topics

Hot tags