TermDocumentMatrix sometimes throwing error
Asked Answered
S

3

7

I am creating a Word Cloud based on Tweets from various different sports teams. This code executes successfully about 1 in 10 times:

handle <- 'arsenal'
txt <- searchTwitter(handle,n=1000,lang='en')
t <- sapply(txt,function(x) x$getText())
t <- gsub('http.*\\s*|RT|Retweet','',t)
t <- gsub(handle,'',t)
t_c <- Corpus(VectorSource(t))
tdm = TermDocumentMatrix(t_c,control = list(removePunctuation = TRUE,stopwords = stopwords("english"),removeNumbers = TRUE, content_transformer(tolower)))
m = as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing=TRUE) 
dm = data.frame(word=names(word_freqs), freq=word_freqs)
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"),rot.per=0.5)

The other 9 out of 10 times, it throws the following error:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
  'i, j, v' different lengths
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion

Any ideas guys? I've googled, but so far have come up short! Keep in mind I'm an absolute newbie in R!

Sydney answered 6/9, 2014 at 10:31 Comment(0)
S
5

So after a bit of playing around, the following line of code has completely fixed my issue:

t <- iconv(t,to="utf-8-mac")
Sydney answered 6/9, 2014 at 10:59 Comment(1)
Can confirm this fixed my problem immediately (running on Mac).Gonococcus
C
2

I suppose you have used the following line of code somewhere before using DocumentTermMatrix command.

corpus = tm_map(corpus, PlainTextDocument)

This line of code converts all text in the corpus to PlainTextDocument, on which the DocumentTermMatrix function does not work properly.

Just repeat entire process of creating the corpus and preprocessing it skipping the above command and you will be good to go.

Castellan answered 8/5, 2017 at 13:25 Comment(0)
W
0

If you remove:

corpus = tm_map(corpus, PlainTextDocument)

you also have to remove:

t_c <- Corpus(VectorSource(t))

Then you'll get the right output for TermDocumentMatrix.

Welby answered 29/1, 2018 at 12:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.