Trying to get tf-idf weighting working in R
Asked Answered
C

1

15

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses, the second containing the second three episodes, if you must know).

R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3

Relevant bit of code:

library('tm')
corpus <- Corpus(DirSource('.'))
dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf))

str(dtm)
List of 6
 $ i       : int [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:12456] 2 10 12 17 20 24 29 30 32 34 ...
 $ v       : num [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2
 $ ncol    : int 10646
 $ dimnames:List of 2
  ..$ Docs : chr [1:2] "bloom.txt" "telemachiad.txt"
  ..$ Terms: chr [1:10646] "_--c'est" "_--et" "_--for" "_--goodbye," ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

You will note, that the weighting appears to still be the default term frequency (tf) rather than the weighted tf-idf scores that I'd like.

Apologies if I'm missing something obvious, but based on the documentation I've read, this should work. The fault, no doubt, lies not in the stars...

Climatology answered 11/2, 2013 at 20:49 Comment(0)
C
24

If you look at the DocumentTermMatrix help page, an at the example, you will see that the control argument is specified this way :

data(crude)
dtm <- DocumentTermMatrix(crude,
           control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                          stopwords = TRUE))

So, the weighting is specified with the list element named weighting, not weight. And you can specify this weighting by passing a function name or a custom function, as in the example. But the following works too :

data(crude)
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf))
Chinkiang answered 11/2, 2013 at 21:0 Comment(2)
Yup. That did it. weighting not weight. I could kick myself. Thanks VERY much!Climatology
Please note that weighting by default do normalize it.Radiogram

© 2022 - 2024 — McMap. All rights reserved.