Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory.
sparseTDM <- function(vc){
id = unlist(lapply(vc, function(x){x$meta$id}))
content = unlist(lapply(vc, function(x){x$content}))
out = strsplit(content, "\\s", perl = T)
names(out) = id
lev.terms = sort(unique(unlist(out)))
lev.docs = id
v1 = lapply(
out,
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = lev.terms
)
v2 = lapply(
seq_along(v1),
function(i, x, n){
rep(i,length(x[[i]]))
},
x = v1,
n = names(v1)
)
stm = data.frame(i = unlist(v1), j = unlist(v2)) %>%
group_by(i, j) %>%
tally() %>%
ungroup()
tmp = simple_triplet_matrix(
i = stm$i,
j = stm$j,
v = stm$n,
nrow = length(lev.terms),
ncol = length(lev.docs),
dimnames = list(Terms = lev.terms, Docs = lev.docs)
)
as.TermDocumentMatrix(tmp, weighting = weightTf)
}
It slows down at calculation of v1
. It was running for 30 minutes and I stopped it.
I've prepared a small example:
b = paste0("string", 1:200000)
a = sample(b,80)
microbenchmark(
lapply(
list(a=a),
function(x, lev) {
sort(as.integer(factor(x, levels = lev, ordered = TRUE)))
},
lev = b
)
)
Results are:
Unit: milliseconds
expr min lq mean median uq max neval
... 25.80961 28.79981 31.59974 30.79836 33.02461 98.02512 100
Id and content has 126522 elements, Lev.terms has 155591 elements, so it looks that I've stopped processing too early. Since ultimately I'll be working on ~6M documents I need to ask... Is there any way to speed up this fragment of code?
out
raw_tokens
.lev.terms
is a bag-of-words.v1
is a word-vector.v2
seems to be an unnecessary non-vectorized way of replicating the doc-id. – Benedix