Adding custom stopwords in R tm
Asked Answered
G

6

17

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords

tm_map(abs, removeWords, stopwords("english")) 

Is there a way to add my own custom stop words to this list?

Grosbeak answered 26/8, 2013 at 14:22 Comment(0)
W
40

stopwords just provides you with a vector of words, just combine your own ones to this.

tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) 
Willena answered 26/8, 2013 at 14:33 Comment(1)
Instead of having to do this for each operation, is there a file or dict where I can add these extra stop words such as percent, cent, million etc?Tyne
C
4

Save your custom stop words in a csv file (ex: word.csv).

library(tm)
stopwords <- read.csv("word.csv", header = FALSE)
stopwords <- as.character(stopwords$V1)
stopwords <- c(stopwords, stopwords())

Then you can apply custom words to your text file.

text <- VectorSource(text)
text <- VCorpus(text)
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, stopwords)
text <- tm_map(text, stripWhitespace)

text[[1]]$content
Cookout answered 15/5, 2017 at 14:5 Comment(1)
please use 4-space indentation for blocks of code (instead of backticking them)Ambur
R
2

You can create a vector of your custom stopwords & use the statement like this:

tm_map(abs, removeWords, c(stopwords("english"), myStopWords)) 
Roster answered 4/11, 2016 at 16:47 Comment(1)
Is the myStopWords expected to be a list or character ? can you provide command for creating myStopWords ? Does this work myStopWords < - read.csv('mystop.csv')Baram
F
2

You could also use the textProcessor package. It works quite well:

textProcessor(documents, 
  removestopwords = TRUE, customstopwords = NULL)
Fuchsin answered 12/7, 2018 at 18:3 Comment(1)
how do you modify the stopwords from the textProcessor function?Elisa
J
1

It is possible to add your own stopwords to the default list of stopwords that came along with tm install. The "tm" package comes with many data files including stopwords, and note that stopwords files come for many languages. You can add, delete, or update the english.dat file under stopwords directory.
The easiest way to find the stopwords directory is to search for "stopwords" directory in your system through your file browser. And you should find english.dat along with many other language files. Open the english.dat file from RStudio which should enable to edit the file - you can add your own words or drop existing words as needed. It is the same process if you want to edit stopwords in any other language.

Jenellejenesia answered 9/1, 2017 at 0:41 Comment(0)
D
1

I am using the stopwords library instead of the tm library. I just decided to put my solution here in case anyone will need it.

# Create a list of custom stopwords that should be added
word <- c("quick", "recovery")
lexicon <-  rep("custom", times=length(word))

# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")

# Add the dataframe to stop_words df that exists in the library stopwords
stop_words <-  dplyr::bind_rows(stop_words, mystopwords)
View(stop_words)
Dupuy answered 11/3, 2021 at 12:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.