How does the removeSparseTerms in R work?
Asked Answered
T

4

21

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix.

How does this method work and what is the logic behind it? I understand the concept of sparseness but does this threshold indicate how many documents should a term be present it, or some other ratio, etc?

Tali answered 27/2, 2015 at 10:55 Comment(3)
I think the basic concept is that most entries in a tdm are empty, meaning that most terms do not appear in most of the documents. Lots and lots of zeros in the matrix. Typically 90% or more are zeros in a large corpus. If you set the threshold value at, say, 95%, the tm package drops enough terms that are very infrequent -- which drove up the sparseness percentage -- so that the resulting less-sparse set of terms has only a measure of 95%. What to keep in mind, however, is that unusual words may be very important in terms of what the content means.Bine
Thanks for asking this question. The documentation for removeSparseTerms is, itself, very sparse...Infirm
I treat sparsity argument as "keeping rate/retaining rate"Trimetrogon
S
52

In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99. The exact interpretation for sparse = 0.99 is that for term $j$, you will retain all terms for which $df_j > N * (1 - 0.99)$, where $N$ is the number of documents -- in this case probably all terms will be retained (see example below).

Near the other extreme, if sparse = .01, then only terms that appear in (nearly) every document will be retained. (Of course this depends on the number of terms and the number of documents, and in natural language, common words like "the" are likely to occur in every document and hence never be "sparse".)

An example of the sparsity threshold of 0.99, where a term that occurs at most in (first example) less than 0.01 documents, and (second example) just over 0.01 documents:

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

Here are a few additional examples with actual text and terms:

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

In the last example with sparse = 0.34, only terms occurring in two-thirds of the documents were retained.

An alternative approach for trimming terms from document-term matrixes based on a document frequency is the text analysis package quanteda. The same functionality here refers not to sparsity but rather directly to the document frequency of terms (as in tf-idf).

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> dfm_trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

This usage seems much more straightforward to me.

Straightforward answered 20/11, 2015 at 15:44 Comment(3)
Amazing explanation, thanks. This should make its way into the R documentation !Tali
Indeed great explanation, maybe you should really contact the authors and add some of those the R documentation or the website at least.Marlette
Also, I think the function name is correct after my edit, but I guess you need to look into documentation of ?df_trim for proper argument usage.Marlette
M
8

In the function removeSparseTerms(), the argument sparse = x means:
"remove all terms whose sparsity is greater than the threshold (x)".
e.g: removeSparseTerms(my_dtm, sparse = 0.90) means remove all terms in the corpus whose sparsity is greater than 90%.

For example, a term that appears say just 4 times in a corpus of say size 1000, will have a frequency of appearance of 0.004 =4/1000.

This term's sparsity will be (1000-4)/1000 = 1- 0.004 = 0.996 = 99.6%.
Therefore if sparsity threshold is set to sparse = 0.90, this term will be removed as its sparsity (0.996) is greater than the upper bound sparsity (0.90).
However, if sparsity threshold is set to sparse = 0.999, this term will not be removed as its sparsity (0.996) is lower than the upper bound sparsity (0.999).

Meaningful answered 23/2, 2018 at 16:25 Comment(0)
H
1

Simple its like frequency of an element, If you set the value as 0, it will return all the items which appear in all the text, wherever if you set it as 1, it will return all the item in text. If I choose 0.5 it will let me to view only the texts that are appearing in 50% of times in the entire element. This is done by calculating after all such per-processing as

1- (sum(no_off_times_of_the_individual_text_element)/sum(no_off_total_text_elements)) <= Set_Value

Height answered 19/4, 2015 at 14:45 Comment(1)
0 is not a valid value for sparse. You must be slightly above it and below 1.0.Infirm
T
0
  • In the context of a document term matrix: if you set it 0.6, you will remove words in the dtm who have 60% or more of their cells empty. If a cell is empty, it means that the word did not appear in that document. If a word has 60% of their cells empty it means they only appear in 40% of the documents.
  • A word with a sparsity of 0.9, has 90% of their cells empty, and this word will only appear in only 10% of the documents.
  • A word with a sparsity of 0.1 has 10% of their cells empty, and this word will only appear in 90% of the documents.
  • When you set the sparsity to 0.9 in the removeSparseTerms function, it means that words with a sparsity of 0.9 or higher are removed. This means that words that only appear in 1-10% of the documents are removed. The words that remain are the words which appear in 11-100% of the documents.
  • When you set the spartisty to 0.1, in the removeSparseTerms function, it means that words with a sparsity of 0.1 or higher are removed. this means that words that only appear 1-90% of the document are removed. The words that remain are the words which appear in 91-100% of the documents.
  • It makes sense that as this sparse parameter decreases, the number of terms outputted decreases, as you require words that appear in a larger majority of the documents. Most of the time there are less words that appear in 91-100% of the documents than words that appear in 11-100% of the documents.
Tarantass answered 16/6 at 1:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.