How to calculate the score based on number of query terms in elasticsearch?
Asked Answered
L

1

6

I want the queries to return a score that gets calculated like:

occurrence of each query term in title + description / number of query terms

for example

EbSearch.add [ 
new_job( id: 1, title: "Java Programmierer", 
description: "Java Programmierer")
]

res = EbSearch.search("Java Programmierer").results.first.score.should == 4

at the moment it outputs 8, because it does the query for each term and sums it up. I could just divide afterwards, but I don't have the analyzed query terms, so compounds could mess up the score.

The query is structured like below:

search = Tire.search index_name do
  query do 
    dis_max do 
       query { string query, fields: ['title^3', 'description.with_synonyms^0.5'], use_dis_max: false, default_operator: "OR" }  
       query { string query, fields: ['title^3', 'description.without_synonyms'], use_dis_max: false, default_operator: "OR"}
    end
  end
end

Any idea how i could solve this problem is greatly appreciated.

EDIT

I realized that i provided not enough context.

Here are some other snippets I already worked out. I wrote a custom SimilarityProvider to disable idf and normalization. https://gist.github.com/outsmartin/6114175

The complete Tire code is found here https://gist.github.com/6114186. It is a little bit more complicated then the example, but it should be understandable.

Leveridge answered 23/7, 2013 at 16:23 Comment(7)
By compounds do you mean like the search phrase "elastic-search" might get tokenised into 2 tokens? Would you want to divide this by 2 then?Vichy
For example, but as i have a lot of German terms I have to split "Javaprogrammierer" into Java and Programmierer as well. Because the query gets executed with all terms I want the score to stay between 0 and 4 in the example.Leveridge
I am still a little confused... From your description above, I understood that a search for "Java Programmierer" should have a score of: (4 : occurrence of each query term in title + description) / (2 : number of query terms) = 2. But you say you want the score to be 4. I'm a bit confused.Vichy
elasticsearch calculates the score for each query term, so it would be 8 for the occurences / 2 = 4.Leveridge
Is it 8 because of boosting? because I can't manage to count 8 occurrences. We could continue this conversation in chatVichy
It is because of boosting, yes. title = 1*3 + desc =1 *1 = 4, this for both java and programmierer so it is 8.Leveridge
Do you know the number of terms before you send the query?Vichy
T
4

You can easily get a list of analyzed terms for your query using analyze command. However, I have to mention that Elasticsearch scoring is much more complicated than it might seem when you run your tests on tiny indices. You can find the formula that Elasticsearch is using in Lucene documentation and you can use explain command to see how this formula is getting applied to your results. I would also suggest testing and tuning your scoring algorithm on an index with a single shard or using dfs_query_then_fetch search type, which produces more precise results on small indices.

Tengler answered 27/7, 2013 at 14:16 Comment(1)
I updated the question with some more information. The idea with the analyze command sounds promising, only drawback is another request to the elasticsearch server.Leveridge

© 2022 - 2024 — McMap. All rights reserved.