tidytext, quanteda, and tm returning different tf-idf scores

library(tm) library(tidyverse) library(quanteda) df <- as.data.frame(cbind(doc = c("doc1", "doc2"), text = c("the quick brown fox jumps over the lazy dog", "The quick brown foxy ox jumps over the lazy god")), stringsAsFactors = F) df.count1 <- df %>% unnest_tokens(word, text) %>% count(doc, word) %>% bind_tf_idf(word, doc, n) %>% select(doc, word, tf_idf) %>% spread(word, tf_idf, fill = 0) df.count2 <- df %>% unnest_tokens(word, text) %>% count(doc, word) %>% cast_dtm(document = doc,term = word, value = n, weighting = weightTfIdf) %>% as.matrix() %>% as.data.frame() df.count3 <- df %>% unnest_tokens(word, text) %>% count(doc, word) %>% cast_dfm(document = doc,term = word, value = n) %>% dfm_tfidf() %>% as.data.frame() > df.count1 # A tibble: 2 x 12 doc brown dog fox foxy god jumps lazy over ox quick the <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 doc1 0 0.0770 0.0770 0 0 0 0 0 0 0 0 2 doc2 0 0 0 0.0693 0.0693 0 0 0 0.0693 0 0 > df.count2 brown dog fox jumps lazy over quick the foxy god ox doc1 0 0.1111111 0.1111111 0 0 0 0 0 0.0 0.0 0.0 doc2 0 0.0000000 0.0000000 0 0 0 0 0 0.1 0.1 0.1 > df.count3 brown dog fox jumps lazy over quick the foxy god ox doc1 0 0.30103 0.30103 0 0 0 0 0 0.00000 0.00000 0.00000 doc2 0 0.00000 0.00000 0 0 0 0 0 0.30103 0.30103 0.30103

You stumbled upon the differences in calculating the term frequencies.

Standard definitions:

TF: Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency: IDF(t) = log(Total number of documents / Number of documents with term t in it)

Tf-idf weight is the product of these quantities TF * IDF

Looks simple, but it isn't. Let's calculate the tf_idf for the word dog in doc1.

First TF for dog: That is 1 term / 9 terms in doc = 0.11111

1/9 = 0.1111111

Now IDF for dog: the log of (2 documents / 1 term). Now there are multiple possibilities, namely: log (or natural log), log2 or log10!

log(2) = 0.6931472
log2(2) = 1
log10(2) = 0.30103

#tf_idf on log:
1/9 * log(2) = 0.07701635

#tf_idf on log2:
1/9 * log2(2)  = 0.11111

#tf_idf on log10:
1/9 * log10(2) = 0.03344778

Now it gets interesting. Tidytext gives you a correct weighting based on log. tm returns the tf_idf based on log2. I expected the value 0.03344778 from quanteda because their base is log10.

But looking into quanteda, it returns the result correctly, but uses a count as default instead of a proportional count. To get everything as it should be, try the code as follows:

df.count3 <- df %>% unnest_tokens(word, text) %>% 
  count(doc, word) %>% 
  cast_dfm(document = doc,term = word, value = n)


dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse")
Document-feature matrix of: 2 documents, 11 features (22.7% sparse).
2 x 11 sparse Matrix of class "dfm"
      features
docs   brown        fox        god jumps lazy over quick the      dog     foxy       ox
  doc1     0 0.03344778 0.03344778     0    0    0     0   0 0        0        0       
  doc2     0 0          0              0    0    0     0   0 0.030103 0.030103 0.030103

That looks better and this is based on log10.

If you use quanteda with adjustments to the parameters, you can get the tidytext or tm outcome by changing the base parameter.

# same as tidytext the natural log
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = exp(1))

# same as tm
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = 2)

Recommended topics

Hot tags