Why is log used when calculating term frequency weight and IDF, inverse document frequency?
Asked Answered
L

5

51

The formula for IDF is log( N / df t ) instead of just N / df t.

Where N = total documents in collection, and df t = document frequency of term t.

Log is said to be used because it “dampens” the effect of IDF. What does this mean?

Also, why do we use log frequency weighing for term frequency as seen here:

enter image description here

Leucotomy answered 21/11, 2014 at 18:33 Comment(1)
See mailman.uib.no/public/corpora/2018-June/thread.htmlConfucianism
H
69

Debasis's answer is correct. I am not sure why he got downvoted.

Here is the intuition: If term frequency for the word 'computer' in doc1 is 10 and in doc2 it's 20, we can say that doc2 is more relevant than doc1 for the word 'computer.

However, if the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 millions, at this point, there is no much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'.

Just like Debasis's answer, adding log is to dampen the importance of term that has a high frequency, e.g. Using log base 2, the count of 1 million will be reduced to 19.9!

We also add 1 to the log(tf) because when tf is equal to 1, the log(1) is zero. By adding one, we distinguish between tf=0 and tf=1.

Hope this helps!

Hordein answered 30/10, 2015 at 6:21 Comment(3)
Great answer, but isn't the question about IDF and not TF? It seems like your reasoning should be applied on TFJaala
yes, the same idea is also applied to the IDF term. The higher IDF, the more uniqueness of the given word/token. Let's say, the total docs are 100M and # of docs with the given token is 10, then 100M/10 = 10M. So applying a log can be helpful.Hordein
There is no log in the TF formula see: en.wikipedia.org/wiki/Tf%E2%80%93idfDozer
H
37

It is not necessarily the case that more the occurrence of a term in a document more is the relevance... the contribution of term frequency to document relevance is essentially a sub-linear function... hence the log to approximate this sub-linear function...

the same is applicable for idf as well... a linear idf function may be boosting too much the document scores with high idf terms (which could be rare terms due to spelling mistakes)... a sublinear function performs much better...

Hospice answered 24/11, 2014 at 9:9 Comment(0)
L
2

I'll try to put my answer more in a practical aspect. Lets take two words - "The" and "Serendipity".

So here the first word "the", if our corpus is of 1000 documents will occur in almost every document but "serendipity" is a rare word and might occur is less documents, for instance we take as it has occurred in only one document.

So, when calculating the IDF of both -

IDF Log(IDF)
The = 1000/1000 = 0 0
Serendipity = 1000/1 =1000 ~6.9

Now we see if we had a TF of range around 0-20 then if our IDF was not a log(IDF) then definitely it would have dominated the TF but if taken as log(IDF) then it would have a equal effect on the result as TF has.

Lectionary answered 15/12, 2020 at 10:50 Comment(0)
E
1

you can think like we are getting information content of word in entire corpus i.e information content = -log(p) = -log(n_i/N) = log(N/n_i).

Estheresthesia answered 16/10, 2019 at 18:48 Comment(0)
V
0

In context of IDF let me take an example:

let's say we have 1000 documents and a term t1 is present in only one document out of thousand and term t2 is present in 2.

if we didn't take log then

IDF of t1 = 1000 IDF of t2 = 500

does that mean t1 is twice the more important and rare? obviously not, if we are talking about big data and millions of documents then word present in 1,2,5 or 10 document would be considered equally important. Thats why to reduce this effect we take log.

Ventral answered 2/9, 2023 at 9:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.