R - slowly working lapply with sort on ordered factor
Asked Answered
P

3

3

Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.

sparseTDM <- function(vc){
  id = unlist(lapply(vc, function(x){x$meta$id}))
  content = unlist(lapply(vc, function(x){x$content}))
  out = strsplit(content, "\\s", perl = T)
  names(out) = id
  lev.terms = sort(unique(unlist(out)))
  lev.docs = id

  v1 = lapply(
    out,
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = lev.terms
  )

  v2 = lapply(
    seq_along(v1),
    function(i, x, n){
      rep(i,length(x[[i]]))
    },
    x = v1,
    n = names(v1)
  )

  stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
    group_by(i, j) %>%
    tally() %>%
    ungroup()

  tmp = simple_triplet_matrix(
    i = stm$i,
    j = stm$j,
    v = stm$n,
    nrow = length(lev.terms),
    ncol = length(lev.docs),
    dimnames = list(Terms = lev.terms, Docs = lev.docs)
  )

  as.TermDocumentMatrix(tmp, weighting = weightTf)
}

It slows down at calculation of v1. It was running for 30 minutes and I stopped it.

I've prepared a small example:

b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
  lapply(
    list(a=a),
    function(x, lev) {
      sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
    },
    lev = b
  )
)

Results are:

Unit: milliseconds
expr      min       lq      mean   median       uq      max neval
...  25.80961 28.79981  31.59974 30.79836 33.02461 98.02512   100

Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?

Peroxidase answered 5/4, 2015 at 23:37 Comment(3)
You should put library(dplyr); library(whatever_else) at the top so your code is reproducible. I'd also put dplyr as a tag, maybe instead of corpus.Douai
Help us understand what the code's doing, it's pretty opaque, a couple of comments would help; also, the variable names. I would have called out raw_tokens. lev.terms is a bag-of-words. v1 is a word-vector. v2 seems to be an unnecessary non-vectorized way of replicating the doc-id.Benedix
So..., I wrote this code when I was starting working with R, so there probably is a lot of non-optimal code. But it worked...Antlia
P
1

For now I've speeded it up replacing

sort(as.integer(factor(x, levels = lev, ordered = TRUE)))

with

ind = which(lev %in% x)
cnt = as.integer(factor(x, levels = lev[ind], ordered = TRUE))
sort(ind[cnt])

Now timings are:

expr      min       lq     mean   median       uq      max neval
...  5.248479 6.202161 6.892609 6.501382 7.313061 10.17205   100
Peroxidase answered 7/4, 2015 at 14:11 Comment(3)
Help us understand why that should be ~5x faster? Why must the factor be ordered?Benedix
It's faster because factor looks for level values only among the values which appear in x. Factor is ordered to make sure the integer values assigned to each factor value will be the same, as their position in the vector given as levels parameter.Antlia
I've checked, and it assigns same values, even w/o ordered = T in R 3.2.3, but it's not guaranteed, that it will be always this way, as implementation of factor function may be changed.Antlia
A
1

I went through many iterations of solving problem in creating quanteda::dfm() (see the GitHub repo here) and the fastest solution, by far, involves using the data.table and Matrix packages to index the documents and tokenised features, counting the features within documents, and plugging the result straight into a sparse matrix like this:

require(data.table)
require(Matrix)

dfm_quanteda <- function(x) {
    docIndex <- 1:length(x)
    if (is.null(names(x))) 
        names(docIndex) <- factor(paste("text", 1:length(x), sep="")) else
            names(docIndex) <- names(x)

    alltokens <- data.table(docIndex = rep(docIndex, sapply(x, length)),
                            features = unlist(x, use.names = FALSE))
    alltokens <- alltokens[features != ""]  # if there are any "blank" features
    alltokens[, "n":=1L]
    alltokens <- alltokens[, by=list(docIndex,features), sum(n)]

    uniqueFeatures <- unique(alltokens$features)
    uniqueFeatures <- sort(uniqueFeatures)

    featureTable <- data.table(featureIndex = 1:length(uniqueFeatures),
                               features = uniqueFeatures)
    setkey(alltokens, features)
    setkey(featureTable, features)

    alltokens <- alltokens[featureTable, allow.cartesian = TRUE]
    alltokens[is.na(docIndex), c("docIndex", "V1") := list(1, 0)]

    sparseMatrix(i = alltokens$docIndex, 
                 j = alltokens$featureIndex, 
                 x = alltokens$V1, 
                 dimnames=list(docs=names(docIndex), features=uniqueFeatures))
}

require(quanteda)
str(inaugTexts)
## Named chr [1:57] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ ...
## - attr(*, "names")= chr [1:57] "1789-Washington" "1793-Washington" "1797-Adams" "1801-Jefferson" ...
tokenizedTexts <- tokenize(toLower(inaugTexts), removePunct = TRUE, removeNumbers = TRUE)
system.time(dfm_quanteda(tokenizedTexts))
##  user  system elapsed 
## 0.060   0.005   0.064 

That's just a snippet of course but the full source code is easily found on the GitHub repo (dfm-main.R).

I also encourage you to use the full dfm() from the package. You can install it from CRAN or the development version using:

devtools::install_github("kbenoit/quanteda")

on your texts to see how that works in terms of performance.

Accordance answered 9/7, 2015 at 5:27 Comment(0)
D
0

Have you tried experimenting with the sort method (algorithm) and specifying quicksort or shell sort?

something like:

sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=shell)

or:

sort(as.integer(factor(x, levels = lev, ordered = TRUE)), method=quick)

Also, you might try using some intermediate variables to evaluate the nested functions in the event the sort algorithm is re-executing these steps again and again:

foo<-factor(x, levels = lev, ordered = TRUE)
bar<-as.integer(foo)
sort(bar, method=quick)

or

sort(bar)

Good luck!

Deppy answered 5/4, 2015 at 23:56 Comment(1)
Even when I remove sorting completely timings are the same. It looks, that the way i find indexes of a elements in b takes that much time.Antlia

© 2022 - 2024 — McMap. All rights reserved.