I am working with the 'tm' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occur in more than 80% of the documents). Can anybody help me with this?
dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, otherWords1)
dsc <- tm_map(dsc, removeWords, otherWords2)
dsc <- tm_map(dsc, removeWords, otherWords3)
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf,
stopwords = FALSE))
dtm = removeSparseTerms(dtm, 0.99)
# ^- Removes overly rare words (occur in less than 2% of the documents)