solr / lucene idf score
Asked Answered
L

2

5

I'm trying to get a better understanding of how lucene scored my search so that I can make necessary tweaks to my search configuration or the document content.

The below is a part of the score breakdown.

product of:

    0.34472802 = queryWeight, product of:
        2.2 = boost
        7.880174 = idf(docFreq=48, maxDocs=47667)
        0.019884655 = queryNorm
      1.9700435 = fieldWeight in 14363, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        7.880174 = idf(docFreq=48, maxDocs=47667)
        0.25 = fieldNorm(doc=14363)
0.26806915 = (MATCH) max of:
  0.07832639 = (MATCH) weight(shortDescription:tires^1.1 in 14363) [DefaultSimilarity], result of:
    0.07832639 = score(doc=14363,freq=1.0 = termFreq=1.0

I understand how the boost is calculated as that is my configuration value

But how was idf calculated (7.880174 = idf value).

According to the lucene, the idf formula is: idf(t) = 1 + log(numDocs/(docFreq+1))

I checked the core admin console and found that my docFreq = maxDocs = 47667.

Using the formula from lucene, I was not able to calculate expected 7.880174. Instead I get: idf = 3.988 = 1 + log(47667/(48+1)).

Is there something I am missing in my formula.

Leta answered 6/12, 2012 at 20:56 Comment(0)
L
3

Looks like the lucene site has a typo.

http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

states 1 + log(numDocs/(docFreq+1))

but it is actually 1 + ln(numDocs/(docFreq+1))

Leta answered 6/12, 2012 at 23:48 Comment(0)
O
9

I think your log function choose 10 as base while in lucene we choose e as base.

log(47667/(48+1), 10) = 2.9880217397306
log(47667/(48+1), e) = 6.8801743154459

The source code of idf method of lucene is:

  public float idf(int docFreq, int numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }

As you see, idf use Java Math.log to calculate idf while Math.log choose e as log function. See Java Math api for detail.

Occam answered 7/12, 2012 at 0:48 Comment(0)
L
3

Looks like the lucene site has a typo.

http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

states 1 + log(numDocs/(docFreq+1))

but it is actually 1 + ln(numDocs/(docFreq+1))

Leta answered 6/12, 2012 at 23:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.