Naive Bayes vs. SVM for classifying text data

Asked 12/2, 2016 at 10:21 Answered 24/12, 2017 at 20:0

machine-learning scikit-learn theory supervised-learning

I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others).

In trying to come up with a model to use, I've had the following two ideas:

Naive Bayes (likely the sklearn multinomial Naive Bayes implementation)
Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation)

I have built both models, and am currently comparing the results.

What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better.

Many thanks!

Sidell answered 12/2, 2016 at 10:21 Comment(1)

You're better off trying both and comparing. No one can answer for your data set. – Headwind 12/2, 2016 at 10:25

The biggest difference between the models you're building from a "features" point of view is that Naive Bayes treats them as independent, whereas SVM looks at the interactions between them to a certain degree, as long as you're using a non-linear kernel (Gaussian, rbf, poly etc.). So if you have interactions, and, given your problem, you most likely do, an SVM will be better at capturing those, hence better at the classification task you want.

The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes.

From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables which are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't an universal approximator. SVMs with the proper choice of Kernel are (as are 2/3 layer neural networks) though, so from that point of view, the theory matches the practice.

But in the end it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.

Fir answered 12/2, 2016 at 10:55 Comment(2)

SVM is not always better. Refer this paper: nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf by Manning. – Paunch 31/12, 2017 at 3:55

@Horia: How do you think does logistic regression compare with naive bayes and SVMs? – Felipafelipe 19/4, 2019 at 10:5

Support Vector Machine (SVM) is better at full-length content.
Multinomial Naive Bayes (MNB) is better at snippets.

MNB is stronger for snippets than for longer documents. While (Ng and Jordan, 2002) showed that NB is better than SVM/logistic regression (LR) with few training cases, MNB is also better with short documents. SVM usually beats NB when it has more than 30–50 training cases, we show that MNB is still better on snippets even with relatively large training sets (9k cases).

Inshort, NBSVM seems to be an appropriate and very strong baseline for sophisticated classification text data.

Source Code: https://github.com/prakhar-agarwal/Naive-Bayes-SVM

Reference: http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf

Cite: Wang, Sida, and Christopher D. Manning. "Baselines and bigrams: Simple, good sentiment and topic classification." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012.

Paunch answered 24/12, 2017 at 20:0 Comment(3)

Thanks for your answer! I have the task to classify medical text documents typically ~ A4 format. What is the preferable classifier for this purpose - MNB or SVM? – Youthen 29/12, 2017 at 23:6

I would say neither. Use NBSVM, to take best of both approaches. I have added a link to my code repository. – Paunch 30/12, 2017 at 12:35

Thanks! Right now I'm looking for Java implementation of NBSVM. I use Datumbox framework for MNB, it also has SVM but I unable to find the NBSVM right now there – Youthen 30/12, 2017 at 19:15

Recommended topics

Hot tags