I am doing a text classification and I have very imbalanced data like
Category | Total Records
Cate1 | 950
Cate2 | 40
Cate3 | 10
Now I want to over sample Cate2 and Cate3 so it at least have 400-500 records, I prefer to use SMOTE over random sampling, Code
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(fewRecords['text'],
fewRecords['category'])
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(X_train, y_train)
It does not work as it can't generate the sample synthetic text, Now when I covert it into vector like
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(fewRecords['category'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(X_train)
ytrain_train = count_vect.transform(y_train)
I am not sure if it is right approach and how to convert vector to real text when I want to predict real category after classification