Let's do some Text Mining
Here I stand with a document term matrix (from the tm
Package)
dtm <- TermDocumentMatrix(
myCorpus,
control = list(
weight = weightTfIdf,
tolower=TRUE,
removeNumbers = TRUE,
minWordLength = 2,
removePunctuation = TRUE,
stopwords=stopwords("german")
))
When I do a
typeof(dtm)
I see that it is a "list" and the structure looks like
Docs
Terms 1 2 ...
lorem 0 0 ...
ipsum 0 0 ...
... .......
So I try a
wordMatrix = as.data.frame( t(as.matrix( dtm )) )
That works for 1000 Documents.
But when I try to use 40000 it doesn't anymore.
I get this error:
Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt
Error in vector ... : Vector can't be NA Additional: In nr * nc NAs created by integer overflow
So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix. The convertion to a vector works but not the one from the vector to the matrix dosen't.
Do you have any suggestions what could be the problem?
Thanks, The Captain
tm::removeSparseTerms
function – YanceyDocumentTermMatrix(..., control(... bounds=list(global = c(N,Inf))))
and set N to e.g. 2,3,4... until the size is small enough. – Hateful