I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and the results are very promising.
This classifer needs to run in real time and constantly analyze documents thrown at it randomly.
However, when I run my classifier in production, the number of false positives is very high and therefore I end up with a very low precision. The reason is simple: there are many more negative samples that the classifer encounters in the real-time scenario (around 90 % of the time) and this does not correspond to the ideal balanced dataset I used for testing and training.
Is there a way I can simulate this real-time case during training or are there any tricks that I can use (including pre-processing on the documents to see if they are suitable for the classifer)?
I was planning to train my classifier using an imbalanced dataset with the same proportions as I have in real-time case but I am afraid that might bias Naive Bayes towards the negative class and lose the recall I have on the positive class.
Any advice is appreciated.