After training my own classifier with nltk, how do I load it in textblob?
Asked Answered
H

3

6

The built-in classifier in textblob is pretty dumb. It's trained on movie reviews, so I created a huge set of examples in my context (57,000 stories, categorized as positive or negative) and then trained it using nltk. I tried using textblob to train it but it always failed:

with open('train.json', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="json")

That would run for hours and end in a memory error.

I looked at the source and found it was just using nltk and wrapping that, so I used that instead, and it worked.

The structure for nltk training set needed to be a list of tuples, with the first part was a Counter of words in the text and frequency of appearance. The second part of tuple was 'pos' or 'neg' for sentiment.

>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later

>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using

>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)

Then I pickled it.

with open('storybayes.pickle','wb') as f:
    pickle.dump(cl,f)

Now... I took this pickled file, and re opened it to get the nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- and tried to feed it into textblob. Instead of

from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())

I tried:

blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
  File "<pyshell#116>", line 1, in <module>
    blob = TextBlob("I love this library", analyzer=cl4)
  File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
    parser, classifier)
  File "C:\python\lib\site-packages\textblob\blob.py", line 323, in 
_initialize_models
    BaseSentimentAnalyzer, BaseBlob.analyzer)
  File "C:\python\lib\site-packages\textblob\blob.py", line 305, in 
_validated_param
    .format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer

what now? I looked at the source and both are classes, but not quite exactly the same.

Hypothesize answered 13/6, 2018 at 2:25 Comment(1)
textblob's classifier creates a class: classifier = NaiveBayesClassifier(train) >> <class 'textblob.classifiers.NaiveBayesClassifier'> and nltk's classifier creates a class: <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- so stuck on how to make nltk's work in textblob.Hypothesize
H
0

Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful: https://spacy.io/usage/spacy-101#section-lightning-tour

"spaCy is the Ruby of Rails of natural language processing."

import spacy
import random

nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff

with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
Hypothesize answered 13/7, 2018 at 18:46 Comment(1)
Update: one year later, I rely on Spacy for solving these types of problems, not nltk, in general.Hypothesize
H
2

I wasn't able to be certain that a nltk corpus cannot work with textblob, and that would surprise me since textblob imports all of the nltk functions in its source code, and is basically a wrapper.

But what I did conclude after many hours of testing is that nltk offers a better built-in sentiment corpus called "vader" that outperformed all of my trained models.

import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE

vader_lexicon and nltk code does a lot more parsing of negation language in sentences in order to negate positive Words. Like when Darth Vader says "lack of faith" that changes the sentiment to its opposite.

I explained it here, with examples of the better results: https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/

That replaces this textblob implementation:

from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE

The vader nltk classifier also has additional documentation here on using it for sentiment analysis: http://www.nltk.org/howto/sentiment.html

textBlob always crashed my computer with as little as 5000 examples.

Hypothesize answered 21/6, 2018 at 17:42 Comment(3)
Thanks for the pointer to nltk and vader. In your link to nltk I don't see any info on how to customize the dictionary.Representation
I don't think I tried customizing Vader's lexicon. I simply used it in place of my custom sentiment models, because it worked better.Hypothesize
Gotcha. You said "The nltk classifier also has better documentation on how to use your own custom made corpus for sentiment analysis". However, I don't see that info on the page you linked to. I might edit that sentence to "Here are docs on nltk sentiment analysis"Representation
C
0

Going over the error message, it seems like the analyzer must be inherited from the abstract class BaseSentimentAnalyzer. As mentioned in the docs here, this class must implement the analyze(text) function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI here. Hence, I believe both these implementations cannot be combined, unless you can implement a new analyze function in NLTK's implementation to make it compatible with TextBlob's.

Contemplative answered 20/6, 2018 at 9:38 Comment(2)
Thanks. I looked at all the code too, in nltk and textblob to see how classes were being used by the source, but didn't understand subclassing enough to be sure.Hypothesize
I am not sure if one can combine the both with a temporary hack, if you can do something or have any updates, please post here :)Contemplative
H
0

Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful: https://spacy.io/usage/spacy-101#section-lightning-tour

"spaCy is the Ruby of Rails of natural language processing."

import spacy
import random

nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff

with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
Hypothesize answered 13/7, 2018 at 18:46 Comment(1)
Update: one year later, I rely on Spacy for solving these types of problems, not nltk, in general.Hypothesize

© 2022 - 2024 — McMap. All rights reserved.