SMOTE, Oversampling on text classification in Python
Asked Answered
P

2

6

I am doing a text classification and I have very imbalanced data like

Category | Total Records
Cate1    | 950
Cate2    |  40
Cate3    |  10

Now I want to over sample Cate2 and Cate3 so it at least have 400-500 records, I prefer to use SMOTE over random sampling, Code

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                                   fewRecords['category'])

sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)

It does not work as it can't generate the sample synthetic text, Now when I covert it into vector like

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(fewRecords['category'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(X_train)
ytrain_train =  count_vect.transform(y_train)

I am not sure if it is right approach and how to convert vector to real text when I want to predict real category after classification

Purusha answered 23/6, 2018 at 9:0 Comment(1)
SMOTE will just create new synthetic samples from vectors. And for that, you will first have to convert your text to some numerical vector. And then use those numerical vectors to create new numerical vectors with SMOTE. But using SMOTE for text classification doesn't usually help, because the numerical vectors that are created from text are very high dimensional, and eventually using SMOTE, results are just same as if you simply replicate the exact samples to over-sample.Tague
S
7

I know this question is over 2 years old and I hope you found a resolution. If in case you are still interested, this could be easily done with imblearn pipelines.

I will proceed under the assumption that you will use a sklearn compatible estimator to perform the classification. Let's say Multinomial Naive Bayes.

Please note how I import Pipeline from imblearn and not sklearn

from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

Import SMOTE as you've done in your code

from imblearn.over_sampling import SMOTE

Do the train-test split as you've done in your code

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
                               fewRecords['category'],stratify=fewRecords['category'], random_state=0
)

Create a pipeline with SMOTE as one of the components

textclassifier =Pipeline([
  ('vect', CountVectorizer()),
   ('tfidf', TfidfTransformer()),
   ('smote', SMOTE(random_state=12)),
   ('mnb', MultinomialNB(alpha =0.1))
])

Train the classifier on training data

textclassifier.fit(X_train, y_train)

Then you can use this classifier for any task including evaluating the classifier itself, predicting new observations etc.

e.g. predicting a new sample

 textclassifier.predict(['sample text'])

would return a predicted category.

For a more accurate model try word vectors as features or more conveniently, perform hyperparameter optimization on the pipeline.

Sherlene answered 27/3, 2021 at 0:56 Comment(2)
Thanks @Sherlene can you give detail of why do you use CountVectorizer and Tfidf in the same pipeline. I've never seen it before in this context of SMOTEDictatorship
You could experiment with the Tfidftransformer removed. With some short text datasets I have seen improved performance with the above combination. But it really depends on the problem.Sherlene
F
3

You need first to transform your text document into fixed length numerical vector, then do anything you want. Try LDA or Doc2Vec.

Faience answered 15/11, 2018 at 7:23 Comment(1)
Could you be more specific with that? I also want to up-sample in a similar manner before using LDA. But it seems I will get vectors, but need to make a matrix out of that for LDA.Impel

© 2022 - 2024 — McMap. All rights reserved.