SMOTE oversampling and cross-validation
Asked Answered
C

2

5

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.

Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?

Cartel answered 6/8, 2015 at 12:52 Comment(0)
L
13

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

Lyonnais answered 28/8, 2015 at 14:39 Comment(1)
you can't imagine how many people do this wrong..I totally agree with you.Dithyrambic
M
2

According to my experience, dividing the data set by hand is not good way to deal with this problem. When you have 1 data set, you should have cross validation on each classifier you use in a way that 1 fold of your cross validation is your test set_which you should not implement SMOTE on it_ and you have 9 other folds as your training set in which you must have a balanced data set. Repeat this action in a loop for 10 times. Then you will have better result than dividing whole data set by hand.

It is obvious that if you apply SMOTE on both test and training set, you are having synthesized test set which gives you a high accuracy that is not actually correct.

Megalomania answered 8/12, 2015 at 11:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.