How to remove stopwords efficiently from a list of ngram tokens in R
Asked Answered
R

3

21

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal.

I'd very much like to have one solution that works for both unigrams and n-grams, although it would be ok to have two versions, one with a "fixed" flag and one with a "regex" flag. I'm putting the two aspects of the question together since someone may have a solution that tries a different approach that addresses both fixed and regular expression stopword patterns.

Formats:

  • tokens are a list of character vectors, which may be unigrams, or n-grams concatenated by a _ (underscore) character.

  • stopwords are a character vector. Right now I am content to let this be a fixed string, but it would be a nice bonus to be able to implement this using regular expression formatted stopwords too.

Desired Output: A list of characters matching the input tokens but with any component token matching a stop word being removed. (This means a unigram match, or a match to one of the terms which the n-gram comprises.)

Examples, test data, and working code and benchmarks to build on:

tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"), 
                text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), 
                text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
                text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")

# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
    lapply(w, function(x) x[-which(x %in% stopwords)])
}

# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
    matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
    lapply(w, function(x) x[-grep(matchPattern, x)])
}

removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test"  "text"  "with"  "few"   "words"
## 
## $text2
## [1] "some"  "more"  "words" "test"  "text" 

removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test"  "text"  "with"  "few"   "words"
## 
## $text2
## [1] "some"  "more"  "words" "test"  "text" 

removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
## 
## $text2
## [1] "some_more"  "more_words" "text_text" 

removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
## 
## $text2
## [1] "some_more_words"

# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
               OP2_1 = removeTokensOP2(tokens1, stopwords),
               OP2_2 = removeTokensOP2(tokens2, stopwords),
               OP2_3 = removeTokensOP2(tokens3, stopwords),
               unit = "relative")
## Unit: relative
## expr      min       lq     mean   median       uq      max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351   100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629   100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244   100
Rakes answered 12/10, 2015 at 0:9 Comment(10)
the method of stopwords removal in tm or qdap is not enough? Though they work the other way, first remove the stopwords then create the n-grams.Viosterol
No, that's easy enough, I'm trying to figure out a efficient way to remove stopword-containing ngrams after construction.Rakes
Have you checked out the new package of Tyler Rinker, termco on github? That looks promising. Haven't had time to check it out yet.Viosterol
basically a vectorized version of grepl for long vectors written in c. yes I was hoping someone would write that, too :} @RcoreStratagem
stringi comes close to that but not vectorized in the way needed here. I didn't use stringi in the examples/base code for this reason (it was not faster for this task in my tests, although it has many other attractive properties). But maybe someone will prove me wrong!Rakes
@Stratagem I believe grepl is already vectorized and written in c. stringi::stri_detect_regex and stringi::stri_detect_fixed are both faster and worth checking out.Vasiliu
@KenBenoit Is it really not possible in this case for you to remove the stopwords after tokenization, but before ngram consturction?Vasiliu
@Vasiliu i can easily make a vectorized version of base r grepl. what I really mean (and as does this question) is that each pattern should be matched individually for each string of text so that rather than comparing three patterns and three strings 1-1, you are performing 3*3 operationsStratagem
Yes of course, that is a fast (and relatively easy) way, but the idea here is to develop a method for removing ngrams (phrases, collocations, etc) downstream after tokenisation has occurred.Rakes
@Vasiliu and of course grepl is sufficiently fast (I find stringi to be slow x <- rep('a', 1e7); system.time(grepl('a', x, fixed = TRUE)); system.time(stri_detect_fixed('a', x))) but the real heavy lifting is going through all the combinations which seems exponentially slower when you have a lot of patterns to match (adding targets is almost trivial)Stratagem
P
5

This isn't really an answer - more of a comment to reply to rawr's comment of going through all combinations of stopwords. With a longer stopwords list, using something like %in% doesn't seem to suffer that dimensionality issue.

library(purrr)
removetokenstst <- function(tokens, stopwords) 
  map2(tokens, 
       lapply(tokens3, function(x) { 
         unlist(lapply(strsplit(x, "_"), function(y) { 
           any(y %in% stopwords) 
         })) 
       }), 
       ~ .x[!.y])

require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords),
           OP2_1 = removeTokensOP2(tokens1, morestopwords),
           OP2_2 = removeTokensOP2(tokens2, morestopwords),
           OP2_3 = removeTokensOP2(tokens3, morestopwords),
           Ak_3 = removetokenstst(tokens3, stopwords),
           Ak_3msw = removetokenstst(tokens3, morestopwords),
           unit = "relative")

Unit: relative
    expr       min        lq       mean    median        uq      max neval
   OP1_1   1.00000   1.00000   1.000000  1.000000  1.000000  1.00000   100
   OP2_1 278.48260 176.22273  96.462854 79.787932 76.904987 38.31767   100
   OP2_2 280.90242 181.22013  98.545148 81.407928 77.637006 64.94842   100
   OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741   100
    Ak_3  15.74301  14.83731   9.340444  7.902213  8.164234 11.27133   100
 Ak_3msw  18.57697  14.45574  12.936594  8.513725  8.997922 24.03969   100

Stopwords

morestopwords = c("a", "about", "above", "after", "again", "against", "all", 
"am", "an", "and", "any", "are", "arent", "as", "at", "be", "because", 
"been", "before", "being", "below", "between", "both", "but", 
"by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do", 
"does", "doesnt", "doing", "dont", "down", "during", "each", 
"few", "for", "from", "further", "had", "hadnt", "has", "hasnt", 
"have", "havent", "having", "he", "hed", "hell", "hes", "her", 
"here", "heres", "hers", "herself", "him", "himself", "his", 
"how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", 
"is", "isnt", "it", "its", "its", "itself", "lets", "me", "more", 
"most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off", 
"on", "once", "only", "or", "other", "ought", "our", "ours", 
"ourselves", "out", "over", "own", "same", "shant", "she", "shed", 
"shell", "shes", "should", "shouldnt", "so", "some", "such", 
"than", "that", "thats", "the", "their", "theirs", "them", "themselves", 
"then", "there", "theres", "these", "they", "theyd", "theyll", 
"theyre", "theyve", "this", "those", "through", "to", "too", 
"under", "until", "up", "very", "was", "wasnt", "we", "wed", 
"well", "were", "weve", "were", "werent", "what", "whats", "when", 
"whens", "where", "wheres", "which", "while", "who", "whos", 
"whom", "why", "whys", "with", "wont", "would", "wouldnt", "you", 
"youd", "youll", "youre", "youve", "your", "yours", "yourself", 
"yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", 
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", 
"x", "y", "z")
Prendergast answered 21/10, 2015 at 14:7 Comment(1)
right but this isn't doing exactly the same thing since %in% is only matching against the table, ie, length of stopwords or whatever you get when you split the strings whereas grepl is going character-by-character. so for stopwords <- c("is", "a", "in", "this"), %in% has four things to do and grepl has many more depending on the target vector and length of those stringsStratagem
N
1

We can improve on the lapply if you have many levels in your list using the parallel package.

Create many levels

tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), 
                text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])

We do this because the parallel package has a lot of overhead to set up, so just increasing the number of iterations on microbenchmark will continue to incur that cost. By increasing the size of the list, you see the true improvement.

library(parallel)
#Setup
cl <- detectCores()
cl <- makeCluster(cl)

#Two functions:

#original
removeTokensOP2 <- function(w, stopwords) { 
  matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
  lapply(w, function(x) x[-grep(matchPattern, x)])
}

#new
removeTokensOPP <- function(w, stopwords) {
  matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
  return(w[-grep(matchPattern, w)])
}

#compare

microbenchmark(
  OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords),
  OP2_2 = removeTokensOP2(tokens2, stopwords),
  unit = 'relative'
)

Unit: relative
  expr      min       lq     mean   median       uq      max neval
 OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000   100
 OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306   100

As the number of levels in your list increases, the performance will improve.

Neutral answered 19/10, 2015 at 15:26 Comment(0)
H
1

You migth consider simlifying your regular expressions, ^ and $ are adding to the overhead

remove_short <- function(x, stopwords) {
  stopwords_regexp <- paste0('(^|_)(', paste(stopwords, collapse = '|'), ')(_|$)')
  lapply(x, function(x) x[!grepl(stopwords_regexp, x)])
}
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
               OP2_1 = removeTokensOP2(tokens2, stopwords),
               OP2_2 = remove_short(tokens2, stopwords),
               unit = "relative")
Unit: relative
  expr      min       lq     mean   median       uq      max neval cld
 OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100 a  
 OP2_1 5.178565 4.768749 4.465138 4.441130 4.262399 4.266905   100   c
 OP2_2 3.452386 3.247279 3.063660 3.068571 2.963794 2.948189   100  b 
Hershelhershell answered 20/10, 2015 at 13:6 Comment(2)
But then I get a positive match for "beautiful" from the stopword "if", etc.Rakes
You are right. Still there is a minor optimizatiou to your regex: Instead of (^|_)is(_|$)|(^|_)a(_|$)|(^|_)in(_|$)|(^|_)this(_|$) you could write it as (^|_)(is|a|in|this)(_|$) I've edited my answer to reflect the differenceHershelhershell

© 2022 - 2024 — McMap. All rights reserved.