I am using text2vec in R and having difficulty writing a stemming function that works with the itoken function in the text2vec package. The text2vec documentation suggests this stemming function:
stem_tokenizer1 =function(x) {
word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
}
However, this function does not work. This is the code I ran (borrowed from previous stackoverflow answers):
library(text2vec)
library(data.table)
library(SnowballC)
data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer1 =function(x) {
word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
}
tok = stem_tokenizer1
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
This is the error it produces:
Error in { : argument "words" is missing, with no default
I believe the issue is that wordStem needs a character vector, but word_tokenizer produces a list of character vectors.
mr<-movie_review$review[1]
stem_mr1<-stem_tokenizer1(mr)
Error in SnowballC::wordStem(language = "en") : argument "words" is missing, with no default
To fix this issue, I wrote this stemming function:
stem_tokenizer2 = function(x) {
list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}
However, this function does not work with the create_vocabulary function.
data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer2 = function(x) {
list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}
tok = stem_tokenizer2
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5)
No error, but when you look at the document count, the number of documents is different than the 1000 in the data, and so you cannot create a document term matrix or run an LDA.
v$document_count
[1] 10
This code:
dtm_train <- create_dtm(it, vectorizer)
dtm_train
Producess this error:
10 x 3809 sparse Matrix of class "dgCMatrix" Error in validObject(x) : invalid class “dgCMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 10
My questions are: is there something wrong with the function I wrote, and why would the function I wrote produce this error with create_vocabulary? I suspect it is a problem with the format of the output of my function, but it looks identical to the word_tokenizer function's output format, and that works fine with itoken and create_vocabulary:
mr<-movie_review$review[1]
word_mr<-word_tokenizer(mr)
stem_mr<-stem_tokenizer2(mr)
str(word_mr)
str(stem_mr)
data("movie_review") train_rows = 1:1000 prepr = tolower stem_tokenizer1 =function(x) { word_tokenizer %>% lapply( function(x) SnowballC::wordStem(x, language="en")) } tok = stem_tokenizer1 it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows]) v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5) v$document_count
– Garlicky