Stronger boosting by date in Solr
Asked Answered
P

3

18

Boosting by date field in solr is defined as:

{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}

I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?

User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.

And the solr debug output is waaay too confusing to me to understand the problem.

Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.

Prima answered 25/2, 2014 at 14:47 Comment(2)
So basically you want to know how the boosting you are using works, to understand which of the values you need to change, in order to make current (closer to NOW) documents more relevant?Lacunar
Yes, which values and how (positive/negative, large or small)...Prima
E
6

User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.

Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.

With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:

score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)

In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.

The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.

There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.

References for this answer:

Shishir

Elurd answered 6/3, 2014 at 7:44 Comment(1)
This looks like a likely problem, now I'll just need to try and find a way to implement this in PHP.Thanks.Prima
A
48

recip(x, m, a, b) implements f(x) = a/(xm+b) with :

  • x : the document age in ms, defined as ms(NOW,<datefield>).

  • m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).

  • a and b are constants (defined arbitrarily).

  • xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
    xm ≈ 0 when the document is new, resulting in a value close to a/b.

  • Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.

  • With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.

How to make a date boosting stronger ?

  • Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.

  • Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).

  • Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :

    bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
    

It is important to note a few things :

  • bf is an additive boost and acts as a bonus added to the score of newer documents.

  • {!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.

  • A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.

  • Do not use recip() for dates more than one reference_time in the future or it will yield negative values.

See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.

Aarau answered 6/3, 2014 at 2:5 Comment(4)
Yes, this is an very thorough explanation. I wish this was the accepted answer.Plate
Great explanation, very useful. Small typo near xm = 1: multiplier needs parenthesis, ie a/(1+b)Marietta
...However, I can't get the math to match: Your example with 6 months seems off - wouldn't it be something like m = 1/(0.5*3.16e10) = 6.33e-11?Marietta
@Marietta You're right, when using 6 months as reference m=6.3e-11 rounded, I don't know where I got the 'e-8' from.. thank you for pointing that out!Aarau
E
6

User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.

Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.

With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:

score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)

In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.

The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.

There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.

References for this answer:

Shishir

Elurd answered 6/3, 2014 at 7:44 Comment(1)
This looks like a likely problem, now I'll just need to try and find a way to implement this in PHP.Thanks.Prima
P
1

There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.

A multiplier of 3.16e-11 changes the units from milliseconds to years (since there are about 3.16e10 milliseconds per year). Thus, a very recent date will yield a value close to 1/(0+1) or 1, a date a year in the past will get a multiplier of about 1/(1+1) or 1/2, and date two years old will yield 1/(2+1) or 1/3.

Paradigm answered 4/3, 2014 at 8:33 Comment(2)
What do you mean by "have to do some tinkering"?Prima
This one is a bit out of date but is still relevant nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr. It will give you are comparison of all available boost methods. What I meant by tinkering is that you would implement/extend the BoostQParserPlugin to produce yourself a BoostedQuery or build a custom Request handler to achieve the same. This may be an overstretch for your scenario, take a look at multiplicative boost with edismax. typo3-media.com/blog/solr-recip-boosting.html - here you can test your recip function.Paradigm

© 2022 - 2024 — McMap. All rights reserved.