Other answers make good points (I upvoted), but I'd like to add some further color.
One should not attempt to infer details about the document from the value of the score (at least not with any standard TF/IDF or BM25 based similarity classes). The only thing these scores tell you is which document is likely to be relevant assuming that the assumptions of the scoring model are correct.
These models assume generally that "rare" words are more important than common words (often "gold" is more important than "made" or "of" since many things are made and the word "of" is likely to be in almost every document, but fewer things are gold...), and documents for which a higher proportion of words match the query are more important than documents with fewer matches. (i.e. 12 matches in a 150 word document probably is more relevant than 14 matches in a 50000 word document)
"rare" is estimated by looking at the documents in the index (the system can't know about anything it hasn't indexed). Therefore the score for a document changes every time any document is added to the index. Either
- The new document contains one of terms in the query you care about, or
- The new document does not contain one of the terms in the query you care about.
In the first case, the fraction of documents goes up (+1 to both numerator and denominator, so if 1 out of 2 did before, now 2 out of 3 do now). In the second case the number of documents goes up and the fraction goes down (1 out of 2 becomes 1 out of 3). Thus in case #1 the score of every previously matching document goes down and in case #2 the score of every previously matching document goes up (the score is proportional to the inverse of document frequency: i.e. 1/IDF, BM25 is trickier but similar)
Mostly it seems people ask this type of question after they made the tactical error of printing the document score in the results that the user sees. The user, not being an information retrieval expert, has no idea what the number means. The user usually complains because they've made a guess about how it works and then found that their guess was wrong. Don't show the score to users, even if you 'normalized' it. The score will only confuse them.
If you really need to ensure that you only get results where all terms match, then you can set q.op=AND
, but this runs a strong risk of users getting completely empty search results. Users are rarely happy with a blank search results page (there are some cases, but it's rare), and users are probably not going to buy anything if they get no results, whereas they might buy the next best thing if you show it to them.
You may still get things that seem like false matches if you are stemming,
using synonyms, or in other cases where the token is modified during analysis. "golden" and "gold" would likely both get stemmed to "gold" and so with stemming your query of "iphone 6s 64GB gold" would also match a document with the text "golden opportunity to win a free case for galaxy note 9".
Scores are for sorting by relevancy. They are not good for anything else.
Finally there IS a way to get at the information about which terms matched from the debug output, but forcing solr to return that output is expensive and may lead to unacceptable query response times and large increases in the size of the data transferred for query responses. This is the option of last resort because it is so costly. Very few use cases derive enough value from parsing this output to pay for the cost of producing it. Also, that output is for debugging and is somewhat more likely to change between solr versions than the rest of the response (to reflect new features if nothing else) and that could make upgrades painful.