Removing overly common words (occur in more than 80% of the documents) in R

Asked 18/9, 2014 at 5:55 Answered 29/3, 2015 at 23:35

I am working with the 'tm' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occur in more than 80% of the documents). Can anybody help me with this?

dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, otherWords1)
dsc <- tm_map(dsc, removeWords, otherWords2)
dsc <- tm_map(dsc, removeWords, otherWords3)
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, 
                         stopwords = FALSE))

dtm = removeSparseTerms(dtm, 0.99) 
# ^-  Removes overly rare words (occur in less than 2% of the documents)

Assets answered 18/9, 2014 at 5:55 Comment(0)

What if you made a removeCommonTerms function

removeCommonTerms <- function (x, pct) 
{
    stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")), 
        is.numeric(pct), pct > 0, pct < 1)
    m <- if (inherits(x, "DocumentTermMatrix")) 
        t(x)
    else x
    t <- table(m$i) < m$ncol * (pct)
    termIndex <- as.numeric(names(t[t]))
    if (inherits(x, "DocumentTermMatrix")) 
        x[, termIndex]
    else x[termIndex, ]
}

Then if you wanted to remove terms that in are >=80% of the documents, you could do

data("crude")
dtm <- DocumentTermMatrix(crude)
dtm
# <<DocumentTermMatrix (documents: 20, terms: 1266)>>
# Non-/sparse entries: 2255/23065
# Sparsity           : 91%
# Maximal term length: 17
# Weighting          : term frequency (tf)

removeCommonTerms(dtm ,.8)
# <<DocumentTermMatrix (documents: 20, terms: 1259)>>
# Non-/sparse entries: 2129/23051
# Sparsity           : 92%
# Maximal term length: 17
# Weighting          : term frequency (tf)

Sedition answered 18/9, 2014 at 6:10 Comment(2)

this is probably an un-SO-like comment, but you are amazing! – Carleton 18/9, 2014 at 13:5

Any idea how this would be possible with the Quanteda package? Moved this here. – Ferrante 11/1, 2017 at 11:8

If you are going to use DocumentTermMatrix, then an alternative approach is to use the bounds$global control option. For example:

ndocs <- length(dcs)
# ignore overly sparse terms (appearing in less than 1% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than 80% of the documents)
maxDocFreq <- ndocs * 0.8
dtm<- DocumentTermMatrix(dsc, control = list(bounds = list(global = c(minDocFreq, maxDocFreq)))

Giveandtake answered 29/3, 2015 at 23:35 Comment(1)

simply brilliant!! :) – Morelock 26/3, 2017 at 2:52

Recommended topics

Hot tags