Naive Bayes probability always 1
Asked Answered
B

1

8

I started using sklearn.naive_bayes.GaussianNB for text classification, and have been getting fine initial results. I want to use the probability returned by the classifier as a measure of confidence, but the predict_proba() method always returns "1.0" for the chosen class, and "0.0" for all the rest.

I know (from here) that "...the probability outputs from predict_proba are not to be taken too seriously", but to that extent?! The classifier can mistake finance-investing or chords-strings, but the predict_proba() output shows no sign of hesitation...

A little about the context:
- I've been using sklearn.feature_extraction.text.TfidfVectorizer for feature extraction, without, for start, restricting the vocabulary with stop_words, or min/max_df --> I have been getting very large vectors.
- I've been training the classifier on an hierarchical category tree (shallow: not more than 3 layers deep) with 7 texts (manually categorized) per category. It is, for now, flat training: I am not taking the hierarchy into account.

The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.
Can this be related? Are the huge vectors at the root of all this?
How do I get meaningful predictions? Do I need to use a different classifier?

Here's the code I'm using:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.externals import joblib

Vectorizer = TfidfVectorizer(input = 'content')
vecs = Vectorizer.fit_transform(TextsList) # ~2000 strings
joblib.dump(Vectorizer, 'Vectorizer.pkl') 
gnb = GaussianNB()
Y = np.array(TargetList) # ~2000 categories 
gnb.fit(vecs.toarray(), Y)
joblib.dump(gnb, 'Classifier.pkl') 
...

#In a different function:
Vectorizer = joblib.load('Vectorizer.pkl')
Classifier = joblib.load('Classifier.pkl')
InputList = [Text] # One string
Vec = Vectorizer.transform(InputList)
Probs = Classifier.predict_proba([Vec.toarray()[0]])[0]
MaxProb = max(Probs)
MaxProbIndex = np.where(Probs==MaxProb)[0][0]
Category = Classifier.classes_[MaxProbIndex]
result = (Category, MaxProb)  

Update:
Following the advice below, I tried MultinomialNB & LogisticRegression. They both return varying probabilities, and are better in any way for my task: much more accurate classification, smaller objects in memory & much better speed (MultinomialNB is lightning fast!).

I now have a new problem: the returned probabilities are very small - typically in the range 0.004-0.012. This is for the predicted/winning category (and the classification is is accurate).

Barbaresi answered 5/8, 2013 at 14:5 Comment(0)
A
15

"...the probability outputs from predict_proba are not to be taken too seriously"

I'm the guy who wrote that. The point is that naive Bayes tends to predict probabilities that are almost always either very close to zero or very close to one; exactly the behavior you observe. Logistic regression (sklearn.linear_model.LogisticRegression or sklearn.linear_model.SGDClassifier(loss="log")) produces more realistic probabilities.

The resulting GaussianNB object is very big (~300MB), and prediction is rather slow: around 1 second for one text.

That's because GaussianNB is a non-linear model and does not support sparse matrices (which you found out already, since you're using toarray). Use MultinomialNB, BernoulliNB or logistic regression, which are much faster at predict time and also smaller. Their assumptions wrt. the input are also more realistic for term features. GaussianNB is really not a good estimator for text classification.

Anchovy answered 5/8, 2013 at 16:43 Comment(9)
Thanks for the swift & very helpful answer. I followed you advice, and here's a small follow-up:Barbaresi
... both LogisticRegression and MultinomialNB returned varying probabilities, almost in perfect agreement with each other, though the numbers were really small: typically in the range 0.004-0.012; SGDClassifier can return probabilities only for binary estimates; all three were indeed faster (MultinomialNB - extremely fast), smaller & more accurate than GaussianNB. Two questions: 1. How do I understand these very low probability values? 2. Is there a tool inside scikit-learn for dimensionality reduction, or should I 'play' with min/max_df while monitoring the score()?Barbaresi
@AviM: if you upgrade to 0.14, then SGDClassifier does multiclass probabilities. Ad 1., if LR gives extreme probabilities, then either your classes are very clear-cut, or you need to regularize more. Ad 2., there are many options for dimensionality reduction. For text, there's TruncatedSVD, but that won't play nicely with MultinomialNB. You can also try feature selection, see the document classification example in the examples directory.Anchovy
Sorry for maybe not being so clear: the "low value" probabilities that I was quoting, are the probabilities for the predicted/winning categories, those which came first; and the classification is rather good now. This is why I had been surprised by the low value. Thanks again.Barbaresi
I will rephrase it as a question: What do I make of the very low probabilities I've been getting (in both MultinomialNB & LogisticRegression) ? can I take them seriously? Is it that my categories are 'too close'? Note: the classification is is rather accurate.Barbaresi
@AviM: how could they be too close? Extreme probabilities rather indicate that your samples are far away from the decision boundary.Anchovy
Too close to one another, I mean. What do you make of max(Classifier.predict_proba(Vec)[0]) giving values like 0.001? (Classifier = MultinomialNB() / LogisticRegression() )Barbaresi
@AviM: you may have an easy problem, or you don't regularize enough. Try tuning alpha (NB) or C (LR).Anchovy
Thanks, I will try. By the way, I wrongly quoted the number: it is around 0.01, not 0.001.Barbaresi

© 2022 - 2024 — McMap. All rights reserved.