How to resample text (imbalanced groups) in a pipeline?

I'm trying to do some text classification using MultinomialNB, but I'm running into problems because my data is unbalanced. (Below is some sample data for simplicity. In actuality, mine is much larger.) I'm trying to resample my data using over-sampling, and I would ideally like to build it into this pipeline.

The pipeline below works fine without over-sampling, but again, in real life my data requires it. It's very imbalanced.

With this current code, I keep getting the error: "TypeError: All intermediate steps should be transformers and implement fit and transform."

How do I build RandomOverSampler into this pipeline?

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]

df = pd.DataFrame(data,columns=['Description','Type'])  

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from imblearn.over_sampling import RandomOverSampler from imblearn.pipeline import Pipeline data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'], ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'], ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'], ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], ['small fruits', 'grapes']] df = pd.DataFrame(data, columns=['Description','Type']) X_train, X_test, y_train, y_test = train_test_split(df['Description'], df['Type'], random_state=0) text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('RUS', RandomOverSampler()), ('clf', MultinomialNB())]) text_clf = text_clf.fit(X_train, y_train) y_pred = text_clf.predict(X_test) print('Score:',text_clf.score(X_test, y_test))

Recommended topics

Hot tags