Quanteda: how to remove my own list of words
Asked Answered
F

1

7

Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines.

How can I remove the custom long list of stopwords from my corpus? How can I do that after stemming?

I have tried creating various formats, converting to string vectors like

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

I have also tried to use such vectors of words in syntax

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

or

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

Nothing works. My stopwords show up in the corpus and in the analysis. What should be the proper way/syntax to apply custom stop words?

Fleecy answered 26/7, 2017 at 12:51 Comment(2)
Could you provide example data?Thirtytwo
Sure: here there is everything. dropbox.com/s/vqasd32m8kmkfi5/text_data.zip?dl=0 It is only five texts and a file with Polish stop words. The rest is just testing syntax if it allows simple DM.Fleecy
G
10

Assuming your polish.stopwords.txt are like this then you should be able to remove them from your corpus easily this way:

stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")

dfm(mycorpus,
    remove = stopwordsPL,
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3))

The solution using readtext is not working because it reads in the entire file as one document. To get the individual words, you would need to tokenise it and to coerce the tokens to character. Probably readLines() is easier.

No need to create a dictionary from stopwordsPL either, since remove should take a character vector. Also, there is no Polish stemmer implemented yet, I am afraid.

Currently (v0.9.9-65) the feature removal in dfm() does not get rid of stop words that form bigrams. To override this, try:

# form the tokens, removing punctuation
mytoks <- tokens(mycorpus, remove_punct = TRUE)
# remove the Polish stopwords, leave pads
mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
## can't do this next one since no Polish stemmer in 
## SnowballC::getStemLanguages()
# mytoks <- tokens_wordstem(mytoks, language = "polish")
# form the ngrams
mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
# construct the dfm
dfm(mytoks)
Griffin answered 26/7, 2017 at 13:37 Comment(1)
Thanks so much! it works. I intend to use your answer to remove the least important ngrams after they are indicated by randomforest.Fleecy

© 2022 - 2024 — McMap. All rights reserved.