Naive Bayes in Quanteda vs caret: wildly different results

Asked 29/1, 2019 at 17:57 Answered 14/4, 2019 at 15:39

Solved r r-caret text-classification supervised-learning quanteda

I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret. However, I can't seem to get caret to work right.

Here is some code for reproduction. First on the quanteda side:

library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)

# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE)

# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE) %>% 
  dfm_select(pattern = training_dfm, 
             selection = "keep")

# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))

# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(nb_quanteda, newdata = test_dfm)
class_table_quanteda <- table(actual_class, predicted_class)
class_table_quanteda
#>             predicted_class
#> actual_class neg pos
#>          neg 202  47
#>          pos  49 202

Not bad. The accuracy is 80.8% percent without tuning. Now the same (as far as I know) in caret

training_m <- convert(training_dfm, to = "matrix")
test_m <- convert(test_dfm, to = "matrix")
nb_caret <- train(x = training_m,
                  y = as.factor(docvars(training_dfm, "Sentiment")),
                  method = "naive_bayes",
                  trControl = trainControl(method = "none"),
                  tuneGrid = data.frame(laplace = 1,
                                        usekernel = FALSE,
                                        adjust = FALSE),
                  verbose = TRUE)

predicted_class_caret <- predict(nb_caret, newdata = test_m)
class_table_caret <- table(actual_class, predicted_class_caret)
class_table_caret
#>             predicted_class_caret
#> actual_class neg pos
#>          neg 246   3
#>          pos 249   2

Not only is the accuracy abysmal here (49.6% - roughly chance), the pos class is hardly ever predicted at all! So I'm pretty sure I'm missing something crucial here, as I would assume the implementations should be fairly similar, but not sure what.

I already looked at the source code for the quanteda function (hoping that it might be built on caret or the underlying package anyway) and saw that there is some weighting and smoothing going on. If I apply the same to my dfm before training (setting laplace = 0 later on), accuracy is a bit better. Yet also only 53%.

Invalidism answered 29/1, 2019 at 17:57 Comment(0)

The answer is that caret (which uses naive_bayes from the naivebayes package) assumes a Gaussian distribution, whereas quanteda::textmodel_nb() is based on a more text-appropriate multinomial distribution (with the option of a Bernoulli distribution as well).

The documentation for textmodel_nb() replicates the example from the IIR book (Manning, Raghavan, and Schütze 2008) and a further example from Jurafsky and Martin (2018) is also referenced. See:

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. An Introduction to Information Retrieval. Cambridge University Press (Chapter 13). https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Jurafsky, Daniel, and James H. Martin. 2018. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Draft of 3rd edition, September 23, 2018 (Chapter 4). https://web.stanford.edu/~jurafsky/slp3/4.pdf

Another package, e1071, produces the same results you found as it is also based on a Gaussian distribution.

library("e1071")
nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
table(actual_class, nb_e1071_pred)
##             nb_e1071_pred
## actual_class neg pos
##          neg 246   3
##          pos 249   2

However both caret and e1071 work on dense matrices, which is one reason they are so mind-numbingly slow compared to the quanteda approach which operates on the sparse dfm. So from the standpoint of appropriateness, efficiency, and (as per your results) the performance of the classifier, it should be pretty clear which one is preferred!

library("rbenchmark")
benchmark(
    quanteda = { 
        nb_quanteda <- textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment"))
        predicted_class <- predict(nb_quanteda, newdata = test_dfm)
    },
    caret = {
        nb_caret <- train(x = training_m,
                          y = as.factor(docvars(training_dfm, "Sentiment")),
                          method = "naive_bayes",
                          trControl = trainControl(method = "none"),
                          tuneGrid = data.frame(laplace = 1,
                                                usekernel = FALSE,
                                                adjust = FALSE),
                          verbose = FALSE)
        predicted_class_caret <- predict(nb_caret, newdata = test_m)
    },
    e1071 = {
        nb_e1071 <- naiveBayes(x = training_m,
                       y = as.factor(docvars(training_dfm, "Sentiment")))
        nb_e1071_pred <- predict(nb_e1071, newdata = test_m)
    },
    replications = 1
)
##       test replications elapsed relative user.self sys.self user.child sys.child
## 2    caret            1  29.042  123.583    25.896    3.095          0         0
## 3    e1071            1 217.177  924.157   215.587    1.169          0         0
## 1 quanteda            1   0.235    1.000     0.213    0.023          0         0

Vesiculate answered 29/1, 2019 at 23:15 Comment(2)

Thanks for this amazing answer! That all makes a lot of sense. Thanks also for the literature! I get now why you bothered to implement nb into quanteda. I also checked which algorithms/packages in caret only operate on dense matrices (I assumed they all handle sparse ones if provided) and was surprised how many still do that. – Invalidism 30/1, 2019 at 13:15

By default, naive_bayes assumes Gaussian distribution for each continuous feature (numeric). For discrete features (character/factor/logical) categorical distribution is automatically used. Kernel density estimation can be also applied to continuous predictors. It will soon handle also count data with Poisson distribution. The function is very general and thus slow...but reasonably slow. The matrix is more than 90% sparse. naive_bayes is for different kind of problems. – Nona 22/4, 2019 at 19:26

The above answer is correct, I just wanted to add that you can use a Bernoulli distribution with both the 'naivebayes' and 'e1071' package by turning your variables into factors. The output of these should match the 'quanteda' textmodel_nb with a Bernoulli distribution.

Moreover, you could check out: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. This implements a Bernoulli, Multinomial, and Gaussian distribution, works with sparse matrices and is blazingly fast (Fastest currently on CRAN).

Zeralda answered 14/4, 2019 at 15:39 Comment(0)

Recommended topics

Hot tags