How to resample text (imbalanced groups) in a pipeline?
Asked Answered
H

1

5

I'm trying to do some text classification using MultinomialNB, but I'm running into problems because my data is unbalanced. (Below is some sample data for simplicity. In actuality, mine is much larger.) I'm trying to resample my data using over-sampling, and I would ideally like to build it into this pipeline.

The pipeline below works fine without over-sampling, but again, in real life my data requires it. It's very imbalanced.

With this current code, I keep getting the error: "TypeError: All intermediate steps should be transformers and implement fit and transform."

How do I build RandomOverSampler into this pipeline?

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]

df = pd.DataFrame(data,columns=['Description','Type'])  

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))
Hamitosemitic answered 9/1, 2019 at 20:45 Comment(0)
M
6

You should use the Pipeline implemented in the imblearn package, not the one from sklearn. E.g., this code runs fine:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline


data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
    ['small fruits', 'grapes']]

df = pd.DataFrame(data, columns=['Description','Type'])

X_train, X_test, y_train, y_test = train_test_split(df['Description'],
    df['Type'], random_state=0)

text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))
Masqat answered 10/1, 2019 at 15:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.