Really fast word ngram vectorization in R
Asked Answered
F

2

16

edit: The new package text2vec is excellent, and solves this problem (and many others) really well.

text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization

I have a pretty large text dataset in R, which I've imported as a character vector:

#Takes about 15 seconds
system.time({
  set.seed(1)
  samplefun <- function(n, x, collapse){
    paste(sample(x, n, replace=TRUE), collapse=collapse)
  }
  words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '')
  sents1 <- sapply(rpois(1000000, 5) + 1, samplefun, words, ' ')
})

I can convert this character data to a bag-of-words representation as follows:

library(stringi)
library(Matrix)
tokens <- stri_split_fixed(sents1, ' ')
token_vector <- unlist(tokens)
bagofwords <- unique(token_vector)
n.ids <- sapply(tokens, length)
i <- rep(seq_along(n.ids), n.ids)
j <- match(token_vector, bagofwords)
M <- sparseMatrix(i=i, j=j, x=1L)
colnames(M) <- bagofwords

So R can vectorize 1,000,000 million short sentences into a bag-of-words representation in about 3 seconds (not bad!):

> M[1:3, 1:7]
10 x 7 sparse Matrix of class "dgCMatrix"
      fqt hqhkl sls lzo xrnh zkuqc mqh
 [1,]   1     1   1   1    .     .   .
 [2,]   .     .   .   .    1     1   1
 [3,]   .     .   .   .    .     .   .

I can throw this sparse matrix into glmnet or irlba and do some pretty awesome quantitative analysis of textual data. Hooray!

Now I'd like to extend this analysis to a bag-of-ngrams matrix, rather than a bag-of-words matrix. So far, the fastest way I've found to do this is as follows (all of the ngram functions I could find on CRAN choked on this dataset, so I got a little help from SO):

find_ngrams <- function(dat, n, verbose=FALSE){
  library(pbapply)
  stopifnot(is.list(dat))
  stopifnot(is.numeric(n))
  stopifnot(n>0)
  if(n == 1) return(dat)
  pblapply(dat, function(y) {
    if(length(y)<=1) return(y)
    c(y, unlist(lapply(2:n, function(n_i) {
      if(n_i > length(y)) return(NULL)
      do.call(paste, unname(as.data.frame(embed(rev(y), n_i), stringsAsFactors=FALSE)), quote=FALSE)
    })))
  })
}

text_to_ngrams <- function(sents, n=2){
  library(stringi)
  library(Matrix)
  tokens <- stri_split_fixed(sents, ' ')
  tokens <- find_ngrams(tokens, n=n, verbose=TRUE)
  token_vector <- unlist(tokens)
  bagofwords <- unique(token_vector)
  n.ids <- sapply(tokens, length)
  i <- rep(seq_along(n.ids), n.ids)
  j <- match(token_vector, bagofwords)
  M <- sparseMatrix(i=i, j=j, x=1L)
  colnames(M) <- bagofwords
  return(M)
}

test1 <- text_to_ngrams(sents1)

This takes about 150 seconds (not bad for a pure r function), but I'd like to go faster and extend to bigger datasets.

Are there any really fast functions in R for n-gram vectorization of text? Ideally I'm looking for an Rcpp function that takes a character vector as input, and returns a sparse matrix of documents x ngrams as output, but would also be happy to have some guidance writing the Rcpp function myself.

Even a faster version of the find_ngrams function would be helpful, as that's the main bottleneck. R is surprisingly fast at tokenization.

Edit 1 Here's another example dataset:

sents2 <- sapply(rpois(100000, 500) + 1, samplefun, words, ' ')

In this case, my functions for creating a bag-of-words matrix take about 30 seconds and my functions for creating a bag-of-ngrams matrix take about 500 seconds. Again, existing n-gram vectorizers in R seem to choke on this dataset (though I'd love to be proven wrong!)

Edit 2 Timings vs tau:

zach_t1 <- system.time(zach_ng1 <- text_to_ngrams(sents1))
tau_t1 <- system.time(tau_ng1 <- tau::textcnt(as.list(sents1), n = 2L, method = "string", recursive = TRUE))
tau_t1 / zach_t1 #1.598655

zach_t2 <- system.time(zach_ng2 <- text_to_ngrams(sents2))
tau_t2 <- system.time(tau_ng2 <- tau::textcnt(as.list(sents2), n = 2L, method = "string", recursive = TRUE))
tau_t2 / zach_t2 #1.9295619
Flemming answered 22/7, 2015 at 17:50 Comment(11)
Hmm have you considered tau::textcnt(as.list(sents), n = 2L, method = "string", recursive = TRUE) instead of find_ngrams? Takes half the time, but delivers only bigrams (n=2).Adolescence
I hadn't tried that one and will. Bigrams would work, if it's faster than my code above for both datasets.Flemming
@Adolescence On both datasets tau::textct is 50% slower on my system. I'll update my question with timings and example code, please try it on your system and compare the results.Flemming
stringdist::qgrams does really fast character qgrams. The author is currently working on supporting words (ints).Yttria
@Flemming Strange. Now I got tau_t1 / zach_t1 = 649.48 / 675.82. Not much of a difference anymore.Adolescence
@JanvanderLaan Thanks! I just talked to the author over on github, and am excited for the new functionalityFlemming
@Flemming I tried your approach adding a cleaning phase beforehand but I can see many bigrams in my trigram model, why is that? gist.github.com/ambodi/b9d3fd69bc02b078b1ab7d180301dd29Chisholm
@Chisholm I updated my question. Try the text2vec package instead of my code: cran.r-project.org/web/packages/text2vec/vignettes/…Flemming
@Flemming text2vec package breaks when I try 1 to 3-grams on 0.5 GB of data. Any suggestions?Chisholm
@Chisholm Try 1grams first, then 1-2grams. Can you post a reproducible example? I was able to do 2-grams on a wikipedia dump with my 16GB-RAM laptop (several GB of text): dsnotes.com/articles/text2vec-0-3Flemming
@Flemming My data is around 0.5GB and I am trying 5-grams using 5 cores here and it crashes R-studio with my 18 GB of RAM: gist.github.com/ambodi/d8fc4fbd071c7235fa858d4146ec96c9 Any help would be appreciated, I am really stuck!Chisholm
H
10

This is a really interesting problem, and one that I have spent a lot of time grappling with in the quanteda package. It involves three aspects that I will comment on, although it's only the third that really addresses your question. But the first two points explain why I have only focused on the ngram creation function, since -- as you point out -- that is where the speed improvement can be made.

  1. Tokenization. Here you are using string::str_split_fixed() on the space character, which is the fastest, but not the best method for tokenizing. We implemented this almost exactly the same was in quanteda::tokenize(x, what = "fastest word"). It's not the best because stringi can do much smarter implementations of whitespace delimiters. (Even the character class \\s is smarter, but slightly slower -- this is implemented as what = "fasterword"). Your question was not about tokenization though, so this point is just context.

  2. Tabulating the document-feature matrix. Here we also use the Matrix package, and index the documents and features (I call them features, not terms), and create a sparse matrix directly as you do in the code above. But your use of match() is a lot faster than the match/merge methods we were using through data.table. I am going to recode the quanteda::dfm() function since your method is more elegant and faster. Really, really glad I saw this!

  3. ngram creation. Here I think I can actually help in terms of performance. We implement this in quanteda through an argument to quanteda::tokenize(), called grams = c(1) where the value can be any integer set. Our match for unigrams and bigrams would be ngrams = 1:2, for instance. You can examine the code at https://github.com/kbenoit/quanteda/blob/master/R/tokenize.R, see the internal function ngram(). I've reproduced this below and made a wrapper so that we can directly compare it to your find_ngrams() function.

Code:

# wrapper
find_ngrams2 <- function(x, ngrams = 1, concatenator = " ") { 
    if (sum(1:length(ngrams)) == sum(ngrams)) {
        result <- lapply(x, ngram, n = length(ngrams), concatenator = concatenator, include.all = TRUE)
    } else {
        result <- lapply(x, function(x) {
            xnew <- c()
            for (n in ngrams) 
                xnew <- c(xnew, ngram(x, n, concatenator = concatenator, include.all = FALSE))
            xnew
        })
    }
    result
}

# does the work
ngram <- function(tokens, n = 2, concatenator = "_", include.all = FALSE) {

    if (length(tokens) < n) 
        return(NULL)

    # start with lower ngrams, or just the specified size if include.all = FALSE
    start <- ifelse(include.all, 
                    1, 
                    ifelse(length(tokens) < n, 1, n))

    # set max size of ngram at max length of tokens
    end <- ifelse(length(tokens) < n, length(tokens), n)

    all_ngrams <- c()
    # outer loop for all ngrams down to 1
    for (width in start:end) {
        new_ngrams <- tokens[1:(length(tokens) - width + 1)]
        # inner loop for ngrams of width > 1
        if (width > 1) {
            for (i in 1:(width - 1)) 
                new_ngrams <- paste(new_ngrams, 
                                    tokens[(i + 1):(length(tokens) - width + 1 + i)], 
                                    sep = concatenator)
        }
        # paste onto previous results and continue
        all_ngrams <- c(all_ngrams, new_ngrams)
    }

    all_ngrams
}

Here is the comparison for a simple text:

txt <- c("The quick brown fox named Seamus jumps over the lazy dog.", 
         "The dog brings a newspaper from a boy named Seamus.")
tokens <- tokenize(toLower(txt), removePunct = TRUE)
tokens
# [[1]]
# [1] "the"    "quick"  "brown"  "fox"    "named"  "seamus" "jumps"  "over"   "the"    "lazy"   "dog"   
# 
# [[2]]
# [1] "the"       "dog"       "brings"    "a"         "newspaper" "from"      "a"         "boy"       "named"     "seamus"   
# 
# attr(,"class")
# [1] "tokenizedTexts" "list"     

microbenchmark::microbenchmark(zach_ng <- find_ngrams(tokens, 2),
                               ken_ng <- find_ngrams2(tokens, 1:2))
# Unit: microseconds
#                                expr     min       lq     mean   median       uq     max neval
#   zach_ng <- find_ngrams(tokens, 2) 288.823 326.0925 433.5831 360.1815 542.9585 897.469   100
# ken_ng <- find_ngrams2(tokens, 1:2)  74.216  87.5150 130.0471 100.4610 146.3005 464.794   100

str(zach_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...
str(ken_ng)
# List of 2
# $ : chr [1:21] "the" "quick" "brown" "fox" ...
# $ : chr [1:19] "the" "dog" "brings" "a" ...

For your really large, simulated text, here is the comparison:

tokens <- stri_split_fixed(sents1, ' ')
zach_ng1_t1 <- system.time(zach_ng1 <- find_ngrams(tokens, 2))
ken_ng1_t1 <- system.time(ken_ng1 <- find_ngrams2(tokens, 1:2))
zach_ng1_t1
#    user  system elapsed 
# 230.176   5.243 246.389 
ken_ng1_t1
#   user  system elapsed 
# 58.264   1.405  62.889 

Already an improvement, I'd be delighted if this could be improved further. I also should be able to implement the faster dfm() method into quanteda so that you can get what you want simply through:

dfm(sents1, ngrams = 1:2, what = "fastestword",
    toLower = FALSE, removePunct = FALSE, removeNumbers = FALSE, removeTwitter = TRUE)) 

(That already works but is slower than your overall result, because the way you create the final sparse matrix object is faster - but I will change this soon.)

Holub answered 24/7, 2015 at 2:59 Comment(5)
I'm glad we can both help each other out!Flemming
Me too. The GitHub version of quanteda now incorporates the changes in both tokenize() and dfm() using the methods in this post. Should work very quickly for you now in the way I described at the end of my answer. Will deal with the remainder of your GitHub issues soon. Thanks!Holub
Comparing Zach's answer, his style is still doing way faster than quanteda. How come? I thought that after your changes, this should have been solved, @Ken BenoitChisholm
@Chisholm quanteda::ngrams() has changed a bit since this post, so I will review soon and get back to you.Holub
@KenBenoit Thanx. I really wanna use quanteda because I like the API but since my text file is large, I revert it and used Zach's solution for now.Chisholm
H
2

Here is a test using the dev version of tokenizers, which you can get using devtools::install_github("ropensci/tokenizers").

Using the definitions of sents1, sents2, and find_ngrams() above:

library(stringi)
library(magrittr)
library(tokenizers)
library(microbenchmark)
library(pbapply)


set.seed(198)
sents1_sample <- sample(sents1, 1000)
sents2_sample <- sample(sents2, 1000)

test_sents1 <- microbenchmark(
  find_ngrams(stri_split_fixed(sents1_sample, ' '), n = 2), 
  tokenize_ngrams(sents1_sample, n = 2),
  times = 25)
test_sents1

Results:

Unit: milliseconds
                                                     expr       min        lq       mean
 find_ngrams(stri_split_fixed(sents1_sample, " "), n = 2) 79.855282 83.292816 102.564965
                    tokenize_ngrams(sents1_sample, n = 2)  4.048635  5.147252   5.472604
    median         uq        max neval cld
 93.622532 109.398341 226.568870    25   b
  5.479414   5.805586   6.595556    25  a 

Testing on sents2

test_sents2 <- microbenchmark(
  find_ngrams(stri_split_fixed(sents2_sample, ' '), n = 2), 
  tokenize_ngrams(sents2_sample, n = 2),
  times = 25)
test_sents2

Results:

Unit: milliseconds
                                                     expr      min       lq     mean
 find_ngrams(stri_split_fixed(sents2_sample, " "), n = 2) 509.4257 521.7575 562.9227
                    tokenize_ngrams(sents2_sample, n = 2) 288.6050 295.3262 306.6635
   median       uq      max neval cld
 529.4479 554.6749 844.6353    25   b
 306.4858 310.6952 332.5479    25  a 

Checking just straight up timing

timing <- system.time({find_ngrams(stri_split_fixed(sents1, ' '), n = 2)})
timing

   user  system elapsed 
 90.499   0.506  91.309 

timing_tokenizers <- system.time({tokenize_ngrams(sents1, n = 2)})
timing_tokenizers

   user  system elapsed 
  6.940   0.022   6.964 

timing <- system.time({find_ngrams(stri_split_fixed(sents2, ' '), n = 2)})
timing

   user  system elapsed 
138.957   3.131 142.581 

timing_tokenizers <- system.time({tokenize_ngrams(sents2, n = 2)})
timing_tokenizers

   user  system elapsed 
  65.22    1.57   66.91

A lot will depend on the texts being tokenized, but that seems to indicate a speedup of 2x to 20x.

Himes answered 14/3, 2018 at 2:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.