DocumentTermMatrix fails with a strange error only when # terms > 3000
Asked Answered
M

0

6

My code below works fine unless I use create a DocumentTermMatrix with more that 3000 terms. This line:

movie_dict <- findFreqTerms(movie_dtm_train, 8)
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict))

Fails with:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
  'i, j, v' different lengths
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion

Is there some way I can handle this? Is a 3000*60000 matrix just too big for DocumentTermMatrix? This seems pretty small for document classification though..

Full code snippet:

n1 <- 60000
n2 <- 70000
#******* loading the data ******************************************
#kaggle sentiment_analysis dataset    
movie_all <- read.delim('train.tsv', stringsAsFactors=FALSE)
movie_raw <- movie_all[1:(n2),]

#******* cleaning the corpus ***************************************
movie_corpus <- Corpus(VectorSource(movie_raw$Phrase))
movie_corpus_clean <- tm_map(movie_corpus, content_transformer(tolower))
movie_corpus_clean <- tm_map(movie_corpus_clean, removeNumbers)
movie_corpus_clean <- tm_map(movie_corpus_clean, removeWords, stopwords())
movie_corpus_clean <- tm_map(movie_corpus_clean, removePunctuation)
movie_corpus_clean <- tm_map(movie_corpus_clean, stripWhitespace)
movie_dtm <- DocumentTermMatrix(movie_corpus_clean)

#*********** break out data into train/test sets *******************
movie_train <- movie_raw[1:(n1),]
movie_corpus_train <- movie_corpus_clean[1:(n1)]
movie_dtm_train <- movie_dtm[1:(n1),]

#*********** remove rare words from document term matrix ***********
movie_dict <- findFreqTerms(movie_dtm_train, 8)
movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict))

Edit This fails:

movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train[1:60000], list(dictionary = movie_dict))

but this works:

d1 <- DocumentTermMatrix(movie_corpus_train[1:30000], list(dictionary = movie_dict))
d2 <- DocumentTermMatrix(movie_corpus_train[30000:60000], list(dictionary = movie_dict))
movie_dtm_hiFq_train <- c(d1, d2)

which leads me to believe this must be a size issue..

Mada answered 22/6, 2014 at 23:55 Comment(3)
Some people have reported this error comes from document encoding: #18505059Duo
Tried your suggestion, which didn't work. See my edit (if I call the function in batches, it works).Mada
I can confirm calling the function in batches does work. However, I also found that omitting all parameters from the function (e.g. stemWords=TRUE) allowed me to call the function on my entire data set, rather than have to break it into chunks.Resonance

© 2022 - 2024 — McMap. All rights reserved.