More efficient means of creating a corpus and DTM with 4M rows
Asked Answered
W

4

13

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

Consider the following code:

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

The output:

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1

My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.

I have heard that I could use data.table but I am not sure how.

I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.

Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf

Whitmer answered 15/8, 2014 at 16:57 Comment(2)
qdap will not be faster for this task as it uses the tm package as a backend. But regex with data.table/dplyr or parallel processing might be.Halmahera
@TylerRinker Thanks so much for the advice. Do you think you could point me in the right direction or (ideally) provide an apples to apples example using the R code I provided above?Whitmer
H
12

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
Halmahera answered 15/8, 2014 at 20:37 Comment(2)
This is an awesome answer. I am using UTF-8 encoded text (Russian characters), and this supports it whereas the other answer doesn't seem to (on my Windows machine). How could I remove numbers and punctuation with this? I looked at cran.r-project.org/web/packages/stringi/stringi.pdf but I am not sure how to apply these methods in this context. Also, the line dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf) seems to confuse terms and documents whereas the TermDocumentMatrix correctly distinguishes between the two.Whitmer
Based on your code I've prepared a function to calculate TermDocumentMatrix avoiding creation of dense matrix, which as I understand is created by do.call(...) in your example. But It's working extremely slow. Have you got any idea how to speed it up?Manichaeism
A
16

Which approach?

data.table is definitely the right way to go. Regex operations are slow, although the ones in stringi are much faster (in addition to being much better). Anything with

I went through many iterations of solving problem in creating quanteda::dfm() for my quanteda package (see the GitHub repo here). The fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix.

In the code below, I've taken for an example texts found with the quanteda package, which you can (and should!) install from CRAN or the development version from

devtools::install_github("kbenoit/quanteda")

I'd be very interested to see how it works on your 4m documents. Based on my experience working with corpuses of that size, it will work pretty well (if you have enough memory).

Note that in all my profiling, I could not improve the speed of the data.table operations through any sort of parallelisation, because of the way they are written in C++.

Core of the quanteda dfm() function

Here is the bare bones of the data.table based source code, in case any one wants to have a go at improving it. It takes a input a list of character vectors representing the tokenized texts. In the quanteda package, the full-featured dfm() works directly on character vectors of documents, or corpus objects, directly and implements lowercasing, removal of numbers, and removal of spacing by default (but these can all be modified if wished).

require(data.table)
require(Matrix)

dfm_quanteda <- function(x) {
    docIndex <- 1:length(x)
    if (is.null(names(x))) 
        names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
            names(docIndex) <- names(x)

    alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
                            features = unlist(x, use.names = FALSE))
    alltokens <- alltokens[features != ""]  # if there are any "blank" features
    alltokens[, "n":=1L]
    alltokens <- alltokens[, by=list(docIndex,features), sum(n)]

    uniqueFeatures <- unique(alltokens$features)
    uniqueFeatures <- sort(uniqueFeatures)

    featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
                               features = uniqueFeatures)
    setkey(alltokens, features)
    setkey(featureTable, features)

    alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
    alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]

    sparseMatrix(i = alltokens$docIndex, 
                 j = alltokens$featureIndex, 
                 x = alltokens$V1, 
                 dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}

require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
##  user  system elapsed 
## 0.060   0.005   0.064 

That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).

quanteda on your example

How's this for simplicity?

require(quanteda)
mytext <- c("Let the big dogs hunt",
            "No holds barred",
            "My child is an honor student")
dfm(mytext, ignoredFeatures = stopwords("english"), stem = TRUE)
# Creating a dfm from a character vector ...
# ... lowercasing
# ... tokenizing
# ... indexing 3 documents
# ... shaping tokens into data.table, found 14 total tokens
# ... stemming the tokens (english)
# ... ignoring 174 feature types, discarding 5 total features (35.7%)
# ... summing tokens by document
# ... indexing 9 feature types
# ... building sparse matrix
# ... created a 3 x 9 sparse dfm
# ... complete. Elapsed time: 0.023 seconds.

# Document-feature matrix of: 3 documents, 9 features.
# 3 x 9 sparse Matrix of class "dfmSparse"
# features
# docs    bar big child dog hold honor hunt let student
# text1   0   1     0   1    0     0    1   1       0
# text2   1   0     0   0    1     0    0   0       0
# text3   0   0     1   0    0     1    0   0       1
Angelika answered 9/7, 2015 at 5:42 Comment(6)
@Whitmer Thanks! dfm() works great on Cyrillic characters too. Our solution to the issues with TermDocument versus DocumentTerm was simple: documents are always and only rows. This is the same as with any data analytic structure, where rows index cases or units and columns indicate variables or features about the units. Terms or their variants are just a type of feature.Angelika
That's a nice speed up. I'd encourage the OP to move the check to this solution if all else is equal.Halmahera
This is fantastic! Is there a way of using dfm without modifications for bigrams (or n-grams), i.e. not single words but the two word combos "Let the", "the big", "big dogs", "dogs hunt" in your mytext[1]?Geoid
Thanks! Yes dfm() can take an ngrams argument, e.g. dfm(mytext, ngrams = 2, concatenator = " ") to produce the results you want.Angelika
Note to this thread: My original solution is very fast, but I have since changed the dfm() code to an even faster method using match() and exploiting the method for constructing a sparse Matrix. See #31570937 for where I discovered this approach.Angelika
How can you do matrix multiplication with the sparse matrices provided by the quanteda package? Moved the question to this thread here.Nikolaus
H
12

I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

data <- data.frame(
    text=c("Let the big dogs hunt",
        "No holds barred",
        "My child is an honor student"
    ), stringsAsFactors = F)

## eliminate this step to work as a MWE
data <- data[rep(1:nrow(data), 100000), , drop=FALSE]

library(stringi)
library(SnowballC)
out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
names(out) <- paste0("doc", 1:length(out))

lev <- sort(unique(unlist(out)))
dat <- do.call(cbind, lapply(out, function(x, lev) {
    tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
}, lev = lev))
rownames(dat) <- sort(lev)

library(tm)
dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 

library(slam)
dat2 <- slam::as.simple_triplet_matrix(dat)

tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
tdm

## or...
dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
dtm
Halmahera answered 15/8, 2014 at 20:37 Comment(2)
This is an awesome answer. I am using UTF-8 encoded text (Russian characters), and this supports it whereas the other answer doesn't seem to (on my Windows machine). How could I remove numbers and punctuation with this? I looked at cran.r-project.org/web/packages/stringi/stringi.pdf but I am not sure how to apply these methods in this context. Also, the line dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf) seems to confuse terms and documents whereas the TermDocumentMatrix correctly distinguishes between the two.Whitmer
Based on your code I've prepared a function to calculate TermDocumentMatrix avoiding creation of dense matrix, which as I understand is created by do.call(...) in your example. But It's working extremely slow. Have you got any idea how to speed it up?Manichaeism
V
2

You have a few choices. @TylerRinker commented about qdap, which is certainly a way to go.

Alternatively (or additionally) you could also benefit from a healthy does of parallelism. There's a nice CRAN page detailing HPC resources in R. It's a bit dated though and the multicore package's functionality is now contained within parallel.

You can scale up your text mining using the multicore apply functions of the parallel package or with cluster computing (also supported by that package, as well as by snowfall and biopara).

Another way to go is to employ a MapReduce approach. A nice presentation on combining tm and MapReduce for big data is available here. While that presentation is a few years old, all of the information is still current, valid and relevant. The same authors have a newer academic article on the topic, which focuses on the tm.plugin.dc plugin. To get around having a Vector Source instead of DirSource you can use coercion:

data("crude")
as.DistributedCorpus(crude)

If none of those solutions fit your taste, or if you're just feeling adventurous, you might also see how well your GPU can tackle the problem. There's a lot of variation in how well GPUs perform relative to CPUs and this may be a use case. If you'd like to give it a try, you can use gputools or the other GPU packages mentioned on the CRAN HPC Task View.

Example:

library(tm)
install.packages("tm.plugin.dc")
library(tm.plugin.dc)

GetDCorpus <-function(textVector)
{
  doc.corpus <- as.DistributedCorpus(VCorpus(VectorSource(textVector)))
  doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
  doc.corpus <- tm_map(doc.corpus, content_transformer(removeNumbers))
  doc.corpus <- tm_map(doc.corpus, content_transformer(removePunctuation))
  # <- tm_map(doc.corpus, removeWords, stopwords("english")) # won't accept this for some reason...
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

dcorp <- GetDCorpus(data[,1])

tdm <- TermDocumentMatrix(dcorp)

inspect(tdm)

Output:

> inspect(tdm)
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 10/20
Sparsity           : 67%
Maximal term length: 7
Weighting          : term frequency (tf)

         Docs
Terms     1 2 3
  barred  0 1 0
  big     1 0 0
  child   0 0 1
  dogs    1 0 0
  holds   0 1 0
  honor   0 0 1
  hunt    1 0 0
  let     1 0 0
  student 0 0 1
  the     1 0 0
Visa answered 15/8, 2014 at 17:40 Comment(9)
Thanks for the resources, but I really can't find any examples how to apply either the hive package for Hadoop or the tm.plugic.dc for Distributed Corpus. These packages seem to use a DirSource whereas I only have a vector source. Are there any good code samples out there?Whitmer
I know what you mean. When I followed the examples to do this with my own work I also had to adjust my code for that difference. It can definitely be done though. I'll see if I can find an nice example where it's done that way.Visa
@Whitmer It's not exactly a long or pretty example, but does the coercion to of data("crude"); dcrude <- as.DistributedCorpus(crude) suffice such that you can then use the rest of the main examples from the Vingettes or other resources?Visa
@Whitmer you've got a typo in the word plugin. If the typo is only in your comment, not actually in the code you ran, then that error is typically resolved for most any package by downloading the package and installing it from source.Visa
Thanks, sorry for that. I get the error > as.DistributedCorpus(data[,1]) Error in UseMethod("as.DCorpus") : no applicable method for 'as.DCorpus' applied to an object of class "character" Do I have to convert it to another type or something?Whitmer
Try making it into a VCorpus first, then use as.DCorpusVisa
Wow, thanks man! This is awesome. I updated your answer with my code. It seems not to accept my stopwords argument (gives an error) but it appears as though the TDM is removing the stopwords automatically... any ideas?Whitmer
@Whitmer You're welcome! Glad it worked. I'm not sure offhand, I'd need to see the error/results. StackOverflow is giving me a warning on the number of comments so maybe you can come to #R on Freenode or you could make another StackOverflow post and we can dig into it that way.Visa
They should turn the comments into a mini-chat window, and let the +1 comments persist or something... Anyway, I updated your answer with repro code and output. I will open a new question if I can't figure it out. Thanks, again, buddy.Whitmer
A
1

This is better than my earlier answer.

The quanteda package has evolved significantly and is now faster and much simpler to use given its built-in tools for this sort of problem -- which is exactly what we designed it for. Part of the OP asked how to prepare the texts for a Bayesian classifier. I've added an example for this too, since quanteda's textmodel_nb() would crunch through 300k documents without breaking a sweat, plus it correctly implements the multinomial NB model (which is the most appropriate for text count matrices -- see also https://mcmap.net/q/903922/-naive-bayes-in-quanteda-vs-caret-wildly-different-results).

Here I demonstrate on the built-in inaugural corpus object, but the functions below would also work with a plain character vector input. I've used this same workflow to process and fit models to 10s of millions of Tweets in minutes, on a laptop, so it's fast.

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

# use a built-in data object
data <- data_corpus_inaugural
data
## Corpus consisting of 58 documents and 3 docvars.

# here we input a corpus, but plain text input works fine too
dtm <- dfm(data, tolower = TRUE, remove_numbers = TRUE, remove_punct = TRUE) %>%
  dfm_wordstem(language = "english") %>%
  dfm_remove(stopwords("english"))

dtm
## Document-feature matrix of: 58 documents, 5,346 features (89.0% sparse).    
tail(dtm, nf = 5)
## Document-feature matrix of: 6 documents, 5 features (83.3% sparse).
## 6 x 5 sparse Matrix of class "dfm"
##               features
## docs           bleed urban sprawl windswept nebraska
##   1997-Clinton     0     0      0         0        0
##   2001-Bush        0     0      0         0        0
##   2005-Bush        0     0      0         0        0
##   2009-Obama       0     0      0         0        0
##   2013-Obama       0     0      0         0        0
##   2017-Trump       1     1      1         1        1

This is a rather trivial example, but for illustration, let's fit a Naive Bayes model, holding out the Trump document. This was the last inaugural speech at the time of this posting ("2017-Trump"), equal in position to the ndoc()th document.

# fit a Bayesian classifier
postwar <- ifelse(docvars(data, "Year") > 1945, "post-war", "pre-war")
textmod <- textmodel_nb(dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)], prior = "docfreq")

The same sorts of commands that work with other fitted model objects (e.g. lm(), glm(), etc.) will work with a fitted Naive Bayes textmodel object. So:

summary(textmod)
## 
## Call:
## textmodel_nb.dfm(x = dtm[-ndoc(dtm), ], y = postwar[-ndoc(dtm)], 
##     prior = "docfreq")
## 
## Class Priors:
## (showing first 2 elements)
## post-war  pre-war 
##   0.2982   0.7018 
## 
## Estimated Feature Scores:
##          fellow-citizen  senat   hous  repres among vicissitud   incid
## post-war        0.02495 0.4701 0.2965 0.06968 0.213     0.1276 0.08514
## pre-war         0.97505 0.5299 0.7035 0.93032 0.787     0.8724 0.91486
##            life  event   fill greater anxieti  notif transmit  order
## post-war 0.3941 0.1587 0.3945  0.3625  0.1201 0.3385   0.1021 0.1864
## pre-war  0.6059 0.8413 0.6055  0.6375  0.8799 0.6615   0.8979 0.8136
##          receiv   14th    day present  month    one  hand summon countri
## post-war 0.1317 0.3385 0.5107 0.06946 0.4603 0.3242 0.307 0.6524  0.1891
## pre-war  0.8683 0.6615 0.4893 0.93054 0.5397 0.6758 0.693 0.3476  0.8109
##           whose  voic    can  never   hear  vener
## post-war 0.2097 0.482 0.3464 0.2767 0.6418 0.1021
## pre-war  0.7903 0.518 0.6536 0.7233 0.3582 0.8979

predict(textmod, newdata = dtm[ndoc(dtm), ])
## 2017-Trump 
##   post-war 
## Levels: post-war pre-war

predict(textmod, newdata = dtm[ndoc(dtm), ], type = "probability")
##            post-war       pre-war
## 2017-Trump        1 1.828083e-157
Angelika answered 27/2, 2019 at 10:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.